Newsletters

Sawmill Newsletter

December 15, 2009

Welcome to the Sawmill Newsletter!

You’re receiving this newsletter because during the downloading or purchase of Sawmill, you checked the box to join our mailing list. If you wish to be removed from this list, please send an email, with the subject line of “UNSUBSCRIBE” to newsletter@sawmill.net (please include the entire message, as the identifying information is at the bottom).

News

Sawmill 8.1.2 shipped on December 10, 2009. This is an bug-fix release--it fixes a number of bugs. This release is free to existing Sawmill 8 users. It is recommended for anyone who is experiencing problems with Sawmill 8.1.1 or earlier. You can download it from http://sawmill.net/download.html .

Sawmill 7 users can upgrade to Sawmill 8 for half of the license price; or if you have Premium Support, the upgrade is free. Major features of Sawmill 8 include support for Oracle and Microsoft SQL Server databases, real-time reporting, a completely redesigned web interface, better multi-processor and multi-core support, and role-based authentication control.

This issue of the Sawmill Newsletter describes how to tune cross-reference tables to achieve the optimum trade-off between database build performance and reporting performance.

Get The Most Out Of Sawmill With Professional Services

Looking to get more out of your statistics from Sawmill? Running short on time, but need the information now to make critical business decisions? Our Professional Service Experts are available for just this situation and many others. We will assist in the initial installation of Sawmill using best practices; work with you to integrate and configure Sawmill to generate reports in the shortest possible time. We will tailor Sawmill to your environment, create a customized solution, be sensitive to your requirements and stay focused on what your business needs are. We will show you areas of Sawmill you may not even be aware of, demonstrating these methods will provide you with many streamlined methods to get you the information more quickly. Often you'll find that Sawmill's deep analysis can even provide you with information you've been after but never knew how to reach, or possibly never realized was readily available in reports. Sawmill is an extremely powerful tool for your business, and most users only exercise a fraction of this power. That's where our experts really can make the difference. Our Sawmill experts have many years of experience with Sawmill and with a large cross section of devices and business sectors. Our promise is to very quickly come up with a cost effective solution that fits your business, and greatly expand your ROI with only a few hours of fee based Sawmill Professional Services. For more information, a quote, or to speak directly with a Professional services expert contact consulting@flowerfire.com.

Tips & Techniques: Tuning Cross-References Tables

When dealing with very large datasets, the speed of database build, speed of database updates, and speed of report generation become important issues. As a general rule, you can speed database builds by slowing reports, or you can speed reports by slowing database builds, using various configuration options. This article discusses one of the key configuration categories that will be configured to tune the build/report tradeoff: cross-reference tables.

The Main Table

Suppose Sawmill is processing a 1-billion line log dataset. The first part of a database build is to parse these billion lines, and put them into the "main table" of the database, one row per line (some formats have multiple lines per event; in those cases it's one row per event).

In the simplest case, without the optimizations described below, Sawmill must query the entire 1-billion-line main table, in order to generate any report. So a report of events by day will require it to examine all 1 billion lines, group them into days, and display the results in the Days table. Sawmill is quite fast at running this sort of query against the main table, considering the work required; but it does take time proportional to the size of the main table, so if it takes 1 second to query a 1 million line table, it will take 1000 seconds to query a billion line table, and so on. This means that for very large datasets, unoptimized full-main-table-query reports will take minutes to run. That's OK for scheduled, emailed reports, but it is too long for truly interactive usage. To speed reports, Sawmill has several optimizations to prevent it from having to read the entire main table for a query; one of the major optimizations is Cross-Reference Tables.

Cross-Reference Tables Are Precomputed Report Tables

Cross-reference tables (sometimes called Cross-Reference Groups, or xrefs) are tables in Sawmill's database which contain pre-computed, pre-aggregated values to use in reports. For instance, continuing with the 1-billion line dataset and Days report example above, there could be a cross-reference table which contains one row per day, and the number of events (and other numerical fields) for each day. If there is such a cross-reference table in the database, Sawmill will use it directly to generate the Days report, instead of scanning the main table. If the dataset is 30 days, this table will be only 30 rows long, and can be queried almost instantly. A report which might have taken 10 seconds, or 100 seconds, will take less than 1 second if generated from an xref table. For a 30-day dataset, this table will be 30 rows regardless of the size of the dataset, so unlike main-table-query reports, xref reports will not be slower for large datasets.

Creating a cross-reference table like this take a while, because it does require a full scan of the main table; but it can be done during database build, rather than during report generation. So a single xref table may add several minutes to the database build time, but it makes the associated top-level report much faster. If reporting speed is critical, the extra build time is well worth the fast reporting time.

However, if reporting time is not critical, or if it's critical only for certain reports, and if database build time is critical, then the time spent building the cross-reference table will not be worth the benefit. In this case, the cross-reference table can be disabled or deleted from the profile, eliminating that part of the database build, and speeding the database build. Reporting for the associated report will then require a main table scan, and may take many seconds, but if it's the build speed that's important, this is an acceptable trade.

Cross-Reference Tables Are Only A Performance Optimization

Cross-reference tables do not affect, in any way, the appearance or availability of reports. They are a performance optimization only--all reports are available, and look the same, with or without cross-references. The only difference is the speed of report generation. So cross-references are never required to generate any report--they just make the report faster.

Managing Cross-Reference Tables In Config

Cross-reference tables can be configured in the Config section of the profile, in More Options -> Cross Reference Groups:

The Cross-Reference Groups Editor

This editor can be used to delete cross-reference tables (permanently), disable cross-reference tables (so they can later be re-enabled, if desired), or create new ones. After deleting or adding a cross-reference table, or editing an existing one, the database will need to be rebuilt (or at least the cross-reference tables, from the command line with "sawmill -p profile -a rcrt").

Flat vs. Hierarchical Cross-Reference Tables

In additional to having a list of fields, cross-reference tables also have one other option: they can be "flat" or "hierarchical" (i.e., not flat). In the case of the Days cross-reference report, a "flat" table would be one which contained one row per day. A hierarchical table would also contain one row per month, one row per year, and one row representing the entire dataset. Hierarchical tables take longer to generate (about twice as long, though it depends on the field), but they make certain queries faster than flat tables. In particular, the Overview can be computed from the single "entire dataset" line of a hierarchical cross-reference table; for a flat table, the Overview will have to be computed by combining all rows, which takes longer. Similarly, a hierarchical report like a Pages/Directories report in web log analysis can be generated directly from a hierarchical cross-reference table, with each row of the cross-reference table used directly as one row of the report (because each directory in the pages hierarchy has a single row in a hierarchical xref table); with a flat table, all files in a folder must be aggregated, to compute the row for that folder.

So, a "flat" cross-reference table is somewhere between no cross-reference table and a full hierarchical table--it speeds reports some, but not as much as a hierarchical table, and it takes some time to build, but not as long as a hierarchical table. If a report needs to be faster than it is without a cross-reference table, but the database build needs to be faster than it would be with a full hierarchical cross-reference table, a "flat" table may be a good compromise.

Sawmill's Default Behavior Favors Reporting Speed

Sawmill's default behavior is to add a hierarchical cross-reference table for every non-aggregating database field. Each table cross-references that field against the date/time field, and aggregates all numerical fields. This set of cross-reference tables provides high speed for all standard top-level unfiltered reports, and for all top-level reports filtered by date. In general, each of the standard reports will have a cross-reference table designed to speed that report, with an optional date filter. For a typical profile, there may be 20-30 reports, so this is 20-30 default cross-reference table, each of them requiring a full table scan during database build. In other words, Sawmill's default behavior is to optimize all reports for speed, by making the database build time longer.

The sum of all this pre-calculation can exceed the time to import the data to the main table, so the time to build all cross-reference tables may be the largest part of the database build time. Removing some or all cross-reference tables can greatly reduce build and update times.

Adding Cross-Reference Tables For A New Report

Though Sawmill optimizes the profile for speed when it is first created, by adding a cross-reference table for every report, it does not optimize custom reports added later. If you add, for instance, a "source IPs by destination IPs" report, it will be generated from the main table, because there is no cross-reference table by default which contains both source IP and destination IP. To speed up that report, add a new cross-reference table containing all columns of the report (source IP, destination IP, and all aggregating numerical columns). If the report is filtered, also add the filtered field(s) to the cross-reference table. Keep in mind, however, that the more non-aggregating columns there are in a cross-reference table, the larger the table is (each field multiplies the size of the table by the number of values in the new field, roughly). Cross-reference tables with two or three non-aggregating fields are usually reasonable--moving to four or five may make the database so large and slow that the reports would be best generated from the main table.

Turning Off All Cross-Reference Tables When Creating A Profile

Because cross-references have such a large impact on build performance, there is a "gang switch" in the Create Profile Wizard can be used to turn them all off, by default. For fastest database build performance, you can un-check the "Turn on cross reference groups" option in the New Profile Wizard (while creating a profile), and all cross-reference groups will be automatically disabled:

Turning Off Cross-References In The New Profile Wizard

As described above, this will greatly speed database builds and updates, but will make all top-level reports much slower, especially for very large datasets.

Advanced Topic: Cross-References And Real-Time

When using real-time importing in a profile, Sawmill "streams" the log data, typically from a command-line log source, adding it to the main table of the database immediately as it is sent by the log source. The database build never ends--it is always importing data. When a report is required, the build pauses temporarily, and the report is generated; then the build resumes.

If there is a cross-reference table which can deliver the requested report, the xref table must be updated with the latest data in the main table (only that table will be updated, not all of them), before the report can be generated. This introduces an extra step to report generation, which can be quite long if a large number of lines have been added since the last report request. Disabling cross-reference tables eliminates this delay, and allows real-time reports to begin generating immediately. However, with cross-reference tables disabled, the report will have to generate directly from the main table, so the query portion of the report will take longer. This means there is a performance cost during reporting either way, whether the cross-reference is on, or off. Generally, the more frequently reporting is expected, the more benefit there will be from cross-reference tables in real-time, because the update to the cross-reference table will be smaller and faster. It is also possible to reduce the xref portion of real-time reports by scheduling a report to be generated (to disk, or to email) periodically, so when an ad hoc report is requested, its xref will have been fairly recently updated by the scheduled task.

Advanced Topic: Two-Stage Database Building: Building Without Cross-References, Then Adding Cross-References

When a dataset is particularly difficult to build, whether because of issues with log filters, dataset size, build time, resources (memory or disk), or even bugs, it can be useful to build it in two stages: first building without cross-references, and when that succeeds, building the cross-references separately. Because the parsing of the log data, and import into the main table, takes only a fraction of the time required for a full database build with all cross-references, and because parsing and import generally requires less memory and disk space than the cross-references, it can be useful to first build the "core" of the database (the main table alone), and build the other pieces only if the core build succeeds. It is also possible to generate reports from the core database immediately when it's done building, while simultaneously building xrefs elsewhere (e.g., in a different installation or profile, using a copy of the core database).

This can be done by (1) disabling all cross-references (perhaps in the Create Profile Wizard), (2) building the database, (3) optionally backing up the successful "core" database, (4) enabling some or all cross-references, and (5) updating the database to force the newly-enabled cross-references to be built. It is also possible to build or rebuild cross-references from the command line with "sawmill -p profile -a rcrt".

If a particular xref table is using too much memory, or causing other problems, this approach can be extended into multiple steps: first build the core database, then add some cross-references, and build them, then add some more, etc., with database backups between each step. This ensures that if resource outage or some other issue caused the cross-reference build to fail, the database is still available to be used.

Two-stage database building is not usually required--it is generally only useful when the database is extremely large (several days to build the database), and reports are required soon. It reduces loss of time in the event of a failure, and allows some reports to be generated, albeit slowly, sooner than they would be available if the entire build had to complete. If there's no hurry, it's usually best to just let the build run, and if it runs out of memory or disk space, fix the problem then and restart it.

Advanced Topic: Cross-Reference Tables In External SQL Databases

Cross-reference tables work similarly in external SQL database (e.g., Microsoft SQL Server, Oracle, or MySQL), as in the internal database, but with one important difference. When a "unique" field is used (like "unique client IPs" or "visitors" or "sessions"), the internal database can represent the necessary list of unique values in a single row, but other SQL databases require multiple rows to represent the list. This means that a Days xref table does not have one row per day, in an external SQL database--it has one row per day per visitor on that day. This makes xref tables larger and slower in external SQL databases. They still greatly outperform direct queries of the main table, but they do not have the compact representation used in the internal database. Therefore, with external SQL databases, the performance benefits of cross-reference tables are reduced, versus the internal database. This is one of the major advantages of the internal database, over external SQL databases.

Related Topic: Database Indices

This newsletter has discussed cross-reference groups, which are a performance optimization which, while not required to generate reports, makes them faster at a cost in database build time. That description also exactly describes database indices. Indices are separate from cross-reference groups, but are similar in that they are computed during database build time, and speed up reports. Indices, however, are used only when filtering, and only when there is no cross-reference table available to deliver the report results. Indices work by pointing Sawmill to only the relevant rows of the main table, so it doesn't need to scan the entire main table to generated a filtered report; indices provide increasing performance benefit as more report filters are applied. Indices can be turned on or off in the Database Fields section of Config, separately for each database field; they also can be all turned off in the Create Profile Wizard, on the same page where all xrefs can be disabled. Much of the discussion above of trade-offs between database build time and report time, for xrefs, also applies to indices.

For the very best database build speed, turn off all cross-references and all indices (and possibly sessions; this can all be done in the Create Profile Wizard).

Professional Services

This newsletter describes customization of Cross-Reference Groups. If you need assistance with tuning the performance of your cross-reference tables, or other aspects of your profiles, or with any other Sawmill tasks, our Sawmill Experts can help. Contact sales@sawmill.net for more information.

[Article revision v1.0]
[ClientID: 43726]