You’re receiving this newsletter because during the
downloading or purchase of
Sawmill, you checked the box to join our mailing list. If you wish to
be removed from this list, please send an email, with the subject line
of “UNSUBSCRIBE” to newsletter@sawmill.net
(please include the entire message, as the identifying information is
at the bottom).
News
Sawmill 8.1.2 shipped on December 10, 2009. This is an bug-fix
release--it fixes a number of
bugs. This release is free to existing Sawmill 8 users. It is
recommended
for anyone who is experiencing problems with Sawmill 8.1.1 or earlier.
You can download it from http://sawmill.net/download.html
.
Sawmill 7 users can upgrade to Sawmill 8 for half of the license price;
or if you have Premium Support, the upgrade is free. Major features of
Sawmill 8
include
support for Oracle and Microsoft SQL Server databases, real-time
reporting, a completely redesigned web interface, better
multi-processor and multi-core support, and role-based authentication
control. This issue of the Sawmill Newsletter describes how to tune
cross-reference tables to achieve the optimum trade-off between
database build performance and reporting performance.
Get The Most Out Of Sawmill With Professional Services
Looking to get more out of your statistics from Sawmill? Running short
on time, but need the information now to make critical business
decisions? Our Professional Service Experts are available for just this
situation and many others. We will assist in the initial installation
of Sawmill using best practices; work with you to integrate and
configure Sawmill to generate reports in the shortest possible time. We
will tailor Sawmill to your environment, create a customized solution,
be sensitive to your requirements and stay focused on what your
business needs are. We will show you areas of Sawmill you may not even
be aware of, demonstrating these methods will provide you with
many streamlined methods to get you the information
more quickly. Often you'll find that Sawmill's deep analysis can even
provide you with information you've been after but never knew how to
reach, or
possibly never realized was readily available in reports. Sawmill is an
extremely powerful tool for your business, and most users only exercise
a fraction of this power. That's where our experts really can make the
difference. Our Sawmill experts have many years of experience with
Sawmill
and with a large cross section of devices and business sectors. Our
promise is to very quickly come up with a cost effective solution that
fits your business, and greatly expand your ROI with only a few
hours of fee based Sawmill Professional Services. For more information,
a quote, or to speak directly with a Professional services expert
contact
consulting@flowerfire.com.
Tips & Techniques: Tuning Cross-References Tables
When dealing with very large datasets, the speed of database build,
speed of database updates, and speed of report generation become
important issues. As a general rule, you can speed database builds by
slowing reports, or you can speed reports by slowing database builds,
using various configuration options. This article discusses one of the
key configuration categories that will be configured to tune the
build/report tradeoff: cross-reference tables.
The Main Table
Suppose Sawmill is processing a 1-billion line log dataset. The first
part of a database build is to parse these billion lines, and put them
into the "main table" of the database, one row per line (some formats
have multiple lines per event; in those cases it's one row per event).
In the simplest case, without the optimizations described below,
Sawmill must query the entire 1-billion-line main table, in order to
generate any report. So a report of events by day will require it to
examine all 1 billion lines, group them into days, and display the
results in the Days table. Sawmill is quite fast at running this sort
of query against the main table, considering the work required; but it
does take time proportional to the size of the main table, so if it
takes 1 second to query a 1 million line table, it will take 1000
seconds to query a billion line table, and so on. This means that for
very large datasets, unoptimized full-main-table-query reports will
take minutes to run. That's OK for scheduled, emailed reports, but it
is too long for truly interactive usage. To speed reports, Sawmill has
several optimizations to prevent it from having to read the entire main
table for a query; one of the major optimizations is Cross-Reference
Tables.
Cross-Reference Tables Are Precomputed Report Tables
Cross-reference tables (sometimes called Cross-Reference Groups, or
xrefs) are tables in Sawmill's database which contain pre-computed,
pre-aggregated values to use in reports. For instance, continuing with
the 1-billion line dataset and Days report example above, there could
be a cross-reference table which contains one row per day, and the
number of events (and other numerical fields) for each day. If there is
such a cross-reference table in the database, Sawmill will use it
directly to generate the Days report, instead of scanning the main
table. If the dataset is 30 days, this table will be only 30 rows long,
and can be queried almost instantly. A report which might have taken 10
seconds, or 100 seconds, will take less than 1 second if generated from
an xref table. For a 30-day dataset, this table will be 30 rows regardless
of the size of the dataset, so unlike main-table-query reports,
xref reports will not be slower for large datasets.
Creating a cross-reference table like this take a while, because it
does require a full scan of the main table; but it can be done during
database build, rather than during report generation. So a single xref
table may add several minutes to the database build time, but it makes
the associated top-level report much faster. If reporting speed is
critical, the extra build time is well worth the fast reporting time.
However, if reporting time is not critical, or if it's critical only
for certain reports, and if database build time is critical, then the
time spent building the cross-reference table will not be worth the
benefit.
In this case, the cross-reference table can be disabled or deleted from
the profile, eliminating that part of the database build, and speeding
the database build. Reporting for the associated report will then
require a main table scan, and may take many seconds, but if it's the
build speed that's important, this is an acceptable trade.
Cross-Reference Tables Are Only A Performance Optimization
Cross-reference tables do not affect, in any way, the appearance or
availability of reports. They are a performance optimization only--all
reports are available, and look the same, with or without
cross-references. The only difference is the speed of report
generation. So cross-references are never required to generate
any report--they just make the report faster.
Managing Cross-Reference Tables In Config
Cross-reference tables can be configured in the Config section of the
profile, in More Options -> Cross Reference Groups:
The Cross-Reference Groups Editor
This editor can be used to delete cross-reference tables (permanently),
disable cross-reference tables (so they can later be re-enabled, if
desired), or create new ones. After deleting or adding a
cross-reference table, or editing an existing one, the database will
need to be rebuilt (or at least the cross-reference tables, from the
command line with "sawmill -p profile -a rcrt").
Flat vs. Hierarchical Cross-Reference Tables
In additional to having a list of fields, cross-reference tables also
have one other option: they can be "flat" or "hierarchical" (i.e., not
flat). In the case of the Days cross-reference report, a "flat" table
would be one which contained one row per day. A hierarchical table
would also contain one row per month, one row per year, and one
row representing the entire dataset. Hierarchical tables take longer to
generate (about twice as long, though it depends on the field), but
they make certain queries faster than flat tables. In particular, the
Overview can be computed from the single "entire dataset" line of a
hierarchical cross-reference table; for a flat table, the Overview will
have to be computed by combining all rows, which takes longer.
Similarly, a hierarchical report like a Pages/Directories report in web
log analysis can be generated directly from a hierarchical
cross-reference table, with each row of the cross-reference table used
directly as one row of the report (because each directory in the pages
hierarchy has a single row in a hierarchical xref table); with a flat
table, all files in a folder must be aggregated, to compute the row for
that folder.
So, a "flat" cross-reference table is somewhere between no
cross-reference table and a full hierarchical table--it speeds reports
some, but not as much as a hierarchical table, and it takes some time
to build, but not as long as a hierarchical table. If a report needs to
be faster than it is without a cross-reference table, but the database
build needs to be faster than it would be with a full hierarchical
cross-reference table, a "flat" table may be a good compromise.
Sawmill's Default Behavior Favors Reporting Speed
Sawmill's default behavior is to add a hierarchical cross-reference
table for every
non-aggregating database field. Each table cross-references that field
against the date/time field, and aggregates all numerical fields. This
set of cross-reference tables provides high speed for all standard
top-level unfiltered reports, and for all top-level reports filtered by
date. In general, each of the standard reports will have a
cross-reference table designed to speed that report, with an optional
date filter. For a typical profile, there may be 20-30 reports, so this
is 20-30 default cross-reference table, each of them requiring a full
table scan during database build. In other words, Sawmill's default
behavior is to optimize all reports for speed, by making the database
build time longer.
The sum of all this pre-calculation can exceed the time to import the
data to the main table, so the time to build all cross-reference
tables may be the largest part of the database build time. Removing
some or all cross-reference tables can greatly reduce build and update
times.
Adding Cross-Reference Tables For A New Report
Though Sawmill optimizes the profile for speed when it is first
created, by adding a cross-reference table for every report, it does
not optimize custom reports added later. If you add, for instance, a
"source IPs by destination IPs" report, it will be generated from the
main table, because there is no cross-reference table by default which
contains both source IP and destination IP. To speed up that report,
add a new cross-reference table containing all columns of the report
(source IP, destination IP, and all aggregating numerical columns). If
the report is filtered, also add the filtered field(s) to the
cross-reference table. Keep in mind, however, that the more
non-aggregating columns there are in a cross-reference table, the
larger the table is (each field multiplies the size of the table by the
number of values in the new field, roughly). Cross-reference tables
with two or three non-aggregating fields are usually reasonable--moving
to four or five may make the database so large and slow that the
reports would be best generated from the main table.
Turning Off All Cross-Reference Tables When Creating A Profile
Because cross-references have such a large impact on build performance,
there is a "gang switch" in the Create Profile Wizard can be used to
turn them all off, by default. For fastest database build performance,
you can un-check the "Turn on cross reference groups" option in the New
Profile Wizard (while creating a profile), and all cross-reference
groups will be automatically disabled:
Turning Off Cross-References In The New Profile Wizard
As described above, this will greatly speed database builds and
updates, but will make all top-level reports much slower, especially
for very large datasets.
Advanced Topic: Cross-References And Real-Time
When using real-time importing in a profile, Sawmill "streams" the log
data, typically from a command-line log source, adding it to the main
table of the database immediately as it is sent by the log source. The
database build never ends--it is always importing data. When a report
is required, the build pauses temporarily, and the report is generated;
then the build resumes.
If there is a cross-reference table which can deliver the requested
report, the xref table must be updated with the latest data in the main
table (only that table will be updated, not all of them), before the
report can be generated. This introduces an extra step to report
generation, which can be quite long if a large number of lines have
been added since the last report request. Disabling cross-reference
tables eliminates this delay, and allows real-time reports to begin
generating immediately. However, with cross-reference tables disabled,
the report will have to generate directly from the main table, so the
query portion of the report will take longer. This means there is a
performance cost during reporting either way, whether the
cross-reference is on, or off. Generally, the more frequently reporting
is expected, the more benefit there will be from cross-reference tables
in real-time, because the update to the cross-reference table will be
smaller and faster. It is also possible to reduce the xref portion of
real-time reports by scheduling a report to be generated (to disk, or
to email) periodically, so when an ad hoc report is requested, its xref
will have been fairly recently updated by the scheduled task.
Advanced Topic: Two-Stage Database Building: Building
Without Cross-References, Then Adding Cross-References
When a dataset is particularly difficult to build, whether because
of issues with log filters, dataset size, build time, resources (memory
or disk), or even bugs, it can be useful to build it in two stages:
first building without cross-references, and when that succeeds,
building the cross-references separately. Because the parsing of the
log data, and import into the main table, takes only a fraction of the
time required for a full database build with all cross-references, and
because parsing and import generally requires less memory and disk
space than the cross-references, it can be useful to first build the
"core" of the database (the main table alone), and build the other
pieces only if the core build succeeds. It is also possible to generate
reports from the core database immediately when it's done building,
while simultaneously building xrefs elsewhere (e.g., in a different
installation or profile, using a copy of the core database).
This can be done by (1) disabling all cross-references (perhaps in the
Create Profile Wizard), (2) building the database, (3) optionally
backing up the successful "core" database, (4) enabling some or all
cross-references, and (5) updating the database to force the
newly-enabled cross-references to be built. It is also possible to
build or rebuild cross-references from the command line with "sawmill
-p profile -a rcrt".
If a particular xref table is using too much memory, or causing other
problems, this approach can be extended into multiple steps: first
build the core database, then add some cross-references, and
build them, then add some more, etc., with database backups between
each step. This ensures that if resource outage or some other issue
caused the cross-reference build to fail, the database is still
available to be used.
Two-stage database building is not usually required--it is
generally only useful when the database is extremely large (several
days to build the database), and reports are required soon. It reduces
loss of time in the event of a failure, and allows some reports
to be generated, albeit slowly, sooner than they would be available if
the entire build had to complete. If there's no hurry, it's usually
best to just let the build run, and if it runs out of memory or disk
space, fix the problem then and restart it.
Advanced Topic: Cross-Reference Tables In External SQL Databases
Cross-reference tables work similarly in external SQL database (e.g.,
Microsoft SQL Server, Oracle, or MySQL), as in the internal database,
but with one important difference. When a "unique" field is used (like
"unique client IPs" or "visitors" or "sessions"), the internal database
can represent the necessary list of unique values in a single row, but
other SQL databases require multiple rows to represent the list. This
means that a Days xref table does not have one row per day, in an
external SQL database--it has one row per day per visitor on that
day. This makes xref tables larger and slower in external SQL
databases. They still greatly outperform direct queries of the main
table, but they do not have the compact representation used in the
internal database. Therefore, with external SQL databases, the
performance benefits of cross-reference tables are reduced, versus the
internal database. This is one of the major advantages of the internal
database, over external SQL databases.
Related Topic: Database Indices
This newsletter has discussed cross-reference groups, which are a
performance optimization which, while not required to generate reports,
makes them faster at a cost in database build time. That description
also exactly describes database indices. Indices are separate from
cross-reference groups, but are similar in that they are computed
during database build time, and speed up reports. Indices, however, are
used only when filtering, and only when there is no cross-reference
table available to deliver the report results. Indices work by pointing
Sawmill to only the relevant rows of the main table, so it doesn't need
to scan the entire main table to generated a filtered report; indices
provide increasing performance benefit as more report filters are
applied. Indices can be turned on or off in the Database Fields section
of Config, separately for each database field; they also can be all
turned off in the Create Profile Wizard, on the same page where all
xrefs can be disabled. Much of the discussion above of trade-offs
between database build time and report time, for xrefs, also applies to
indices.
For the very best database build speed, turn off all cross-references
and all indices (and possibly sessions; this can all be done in the
Create Profile Wizard).
Professional Services
This newsletter describes customization of Cross-Reference Groups. If
you need
assistance with tuning the performance of your cross-reference tables,
or other aspects of your profiles, or
with
any other Sawmill tasks, our Sawmill Experts
can help. Contact sales@sawmill.net
for more
information.