{= include("docs.util"); start_docs_page(docs.technical_manual.page_titles.newsletters); =}

Sawmill Newsletter

August 15, 2007

Welcome to the Sawmill Newsletter!

You’re receiving this newsletter because during the downloading or purchase of Sawmill, you checked the box to join our mailing list. If you wish to be removed from this list, please send an email, with the subject line of “UNSUBSCRIBE” to newsletter@sawmill.net .

News

Sawmill 7.2.10 shipped on August 4, 2007. This is a minor "bug fix" release, and it is free to existing Sawmill 7 users. It is not a critical update, but it does fix a number of bugs, adds support for many new log formats, and adds a few small features. It is recommended for anyone who is experiencing problems with Sawmill 7.2.9 or earlier. You can download it from http://sawmill.net/download.html .

This issue of the Sawmill Newsletter describes using database merges to improve database build performance.

Get the Most out of Sawmill with Professional Services

Looking to get more out of your statistics from Sawmill? Running short on time, but need the information now to make critical business decisions? Our Professional Service Experts are available for just this situation and many others. We will assist in the initial installation of Sawmill using best practices; work with you to integrate and configure Sawmill to generate reports in the shortest possible time. We will tailor Sawmill to your environment, create a customized solution, be sensitive to your requirements and stay focused on what your business needs are. We will show you areas of Sawmill you may not even be aware of, demonstrating these methods will provide you with many streamlined methods to get you the information more quickly. Often you'll find that Sawmill's deep analysis can even provide you with information you've been after but never knew how to reach, or possibly never realized was readily available in reports. Sawmill is an extremely powerful tool for your business, and most users only exercise a fraction of this power. That's where our experts really can make the difference. Our Sawmill experts have many years of experience with Sawmill and with a large cross section of devices and business sectors. Our promise is to very quickly come up with a cost effective solution that fits your business, and greatly expand your ROI with only a few hours of fee based Sawmill Professional Services. For more information, a quote, or to speak directly with a Professional services expert contact consulting@flowerfire.com.

Tips & Techniques: Using Database Merges

Note: Database merge is available only with the internal database; it is not available for profiles that use a MySQL database.

A default profile created in Sawmill uses a single processor (single core) to parse log data and build the database. This is a good choice for shared environments, where using all processors can bog down the system, but for best performance, it is best to set "log processing threads" to the number of processors, in the Log Processing options in the Config page of the profile. That will split log processing across multiple processors, improving the performance of database builds and updates by using all processors on the system. This is available with Sawmill Enterprise--non-Enterprise versions of Sawmill can only use one processor.

If the dataset is too large to process in an acceptable time on a single computer, even with multiple processors, it is possible to split the processing across multiple machines. This is accomplished by building a separate database on each system, and then merging them to form a single large database. For instance, this command line adds the data from the database for profile2 to the database for profile1:

sawmill -p profile1 -a md -mdd Databases/profile2/main

or on Windows:

SawmillCL -p profile1 -a md -mdd Databases\profile2\main

After this command completes, profile1 will show the data it showed before the command, and the data that profile2 showed before the command (profile2 will be unchanged).

This makes it possible to build a database twice as fast using this sequence:

sawmill -p profile1 -a bd
sawmill -p profile2 -a bd
sawmill -p profile1 -a md -mdd Databases/profile2/main

(Use SawmillCL and \ slashes on Windows, as shown above).

The critical piece is that the first two commands must run simultaneously; if you run them one after another, they will take as long as building the whole database. But on a two-processor system, they can both use a full CPU, fully using both CPUs, and running nearly twice as fast as a single build. The merge then takes some extra time, but overall this is still faster than a single-process build.

Running a series of builds simultaneously can be done by opening multiple windows and running a separate build in each window, or by "backgrounding" each command before starting the next (available on UNIX and similar systems). But for a fully automated environment, this is best done with a script. The attached perl script, multisawmill.pl, can be used to build multiple databases simultaneously. You will need to modify the top of the script to match your environment, and set the number of threads; then when you run it, it will spawn many database builds simultaneously (the number you specified), and as each completes, it will start another one. This script is provided as-is, with no warranty, as a proof-of-concept of a multiple-simultaneous-build script.

Using the attached script, or something like it, you can apply this approach to much larger datasets, for instance to build a year of data:

1. Create a profile for each day in the year (it is probably easiest to use Create Many Profiles to do this; see Setting Up Multiple Users in the Sawmill documentation).

2. Build all profiles, 8 at a time (or however many cores you have available). If you have multiple machines available, you can use multiple installations of Sawmill, by partitioning the profiles into multiple systems. For instance, if you have two 8-core nodes in the Sawmill cluster, you could build 16 databases at a time, or if you had four 4-core nodes in the cluster, you could build 32 databases at a time. This portion of the build can give a linear speedup, with nearly 32x faster log processing than using a single process, by using a 8-core 4-node cluster.

3. Merge all the databases. The simplest way to do this, in a 365-day example, is to run 364 merges, adding each day into the final one-year database.

When the merge is done, the one-year database will function as though it had been built in a single "build database" step--but it will have taken much less time to build.

Advanced Topic: Using Binary Merges

The example described above uses "sequential merges" for step 3--it runs 364 separate merge steps, one after another, to create the final database. Each of these merges uses only a single processor of a single node, so this portion of the build does not use the cluster efficiently; and this can cause step 3 to take longer than step 2: the merge can be slower than the processing and building of data. To improve this, a more sophisticated merge method can be scripted, using a "binary tree" of merges to build the final database. Roughly, each code on each node is assigned two one-day databases, which they merge, forming two-day databases. Then each core of each node is assigned two two-day databases, which they merge to form a four-day database. This continues until a final merge combines two half-year databases into a one-year database. The number of merge stages is much less than the number of merges required if done sequentially.

For simplicity, let's assume we're merging 16 days, on a 4-core cluster. On a 4-core cluster, we can do 4 merges at a time.

Step 1, core 1: Merge day1 with day2, creating day[1,2].
Step 1, core 2: Merge day3 with day4, creating day[3,4].
Step 1, core 3: Merge day5 with day6, creating day[5,6].
Step 1, core 4: Merge day7 with day8, creating day[7,8].

When those are complete, we would continue:

Step 2, core 1: Merge day9 with day10, creating day[9,10].
Step 2, core 2: Merge day11 with day12, creating day[11,12].
Step 2, core 3: Merge day13 with day14, creating day[13,14].
Step 2, core 4: Merge day14 with day16, creating day[15,16].

Now we have taken 16 databases and merged them in two steps into 8 databases. Now we merge them into four databases:

Step 3, core 1: Merge day[1,2] with day[3,4], creating day[1,2,3,4].
Step 3, core 2: Merge day[5,6] with day[7,8], creating day[5,6,7,8].
Step 3, core 3: Merge day[9,10] with day[11,12], creating day[9,10,11,12].
Step 3, core 4: Merge day[13,14] with day[15,16], creating day[13,14,15,16].

Now we merge into two databases:

Step 4, core 1: Merge day[1,2,3,4] with day[5,6,7,8], creating day[1,2,3,4,5,6,7,8].
Step 4, core 2: Merge day[9,10,11,12] with day[13,14,15,16], creating day[9,10,11,12,13,14,15,16].

And finally:

Step 5, core 1: Merge day[1,2,3,4,5,6,7,8] with day[9,10,11,12,13,14,15,16], creating day[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16].

So in 5 steps, we have build what would have required 15 steps using sequential merges: a 16-day database. This approach can be used to speed up much larger merges even more.

Advanced Topic: Re-using One-Day Databases

In the approach above, the one-day databases are not destroyed by the merge, which reads data from them but does not write to them. This makes it possible to keep the one-day databases for fast access to reports from a particular day. By leaving the one-day databases after the merge is complete, users will be able to select a particular database from the Profiles list, to see fast reports for just that day (a one-day database is much faster to generate reports than a 365-day database).

Advanced Topic: Using Different Merge Units

In the discussion above, we used one day as the unit of merge, but any unit can be used. In particular, if you are generating a database showing reports from 1000 sites, you could use a site as the unit. After building the databases from 1000 sites, you could then merge all 1000 databases to create an all-sites profile for administrative overview, leaving each of the 1000 one-site profiles to be accessed by its users.

Questions or suggestions? Contact support@sawmill.net. If would you like a Sawmill Professional Services expert to implement this, or another customization, contact consulting@sawmill.net.

[Article revision v1.2]
[ClientID: ]

{= end_docs_page() =}