You’re receiving this newsletter because during the
downloading or purchase of
Sawmill, you checked the box to join our mailing list. If you wish to
be removed from this list, please send an email, with the subject line
of “UNSUBSCRIBE” to newsletter@sawmill.net .
News
Sawmill 7.2.10 shipped on August 4, 2007. This is a minor "bug fix"
release, and it is free to existing Sawmill 7 users. It is not a
critical update, but it does fix a number of bugs, adds support for
many new log formats, and adds a few small features. It is recommended
for anyone who is experiencing problems with Sawmill 7.2.9 or earlier.
You can download it from http://sawmill.net/download.html
. This issue of the Sawmill Newsletter describes techniques for
improving the performance of database updates.
Get the Most out of Sawmill with Professional Services
Looking to get more out of your statistics from Sawmill? Running short
on time, but need the information now to make critical business
decisions? Our Professional Service Experts are available for just this
situation and many others. We will assist in the initial installation
of Sawmill using best practices; work with you to integrate and
configure Sawmill to generate reports in the shortest possible time. We
will tailor Sawmill to your environment, create a customized solution,
be sensitive to your requirements and stay focused on what your
business needs are. We will show you areas of Sawmill you may not even
be aware of, demonstrating these methods will provide you with
many streamlined methods to get you the information
more quickly. Often you'll find that Sawmill's deep analysis can even
provide you with information you've been after but never knew how to
reach, or
possibly never realized was readily available in reports. Sawmill is an
extremely powerful tool for your business, and most users only exercise
a fraction of this power. That's where our experts really can make the
difference. Our Sawmill experts have many years of experience with
Sawmill
and with a large cross section of devices and business sectors. Our
promise is to very quickly come up with a cost effective solution that
fits your business, and greatly expand your ROI with only a few
hours of fee based Sawmill Professional Services. For more information,
a quote, or to speak directly with a Professional services expert
contact
consulting@flowerfire.com.
Tips & Techniques: Improving the Performance of Database Updates
A typical installation of Sawmill updates the database for each profile
nightly. Each profile points its log source to the growing dataset it
is analyzing, and each night, a scheduled database update looks for new
data in the log source, and adds it to the database.
This works fine for most situations, but if the dataset gets very
large, or has a very large number of files in it, updates can become
too slow. In a typical "nightly update" environment, the updates are
too slow if they have not completed by the time reports are needed in
the morning. For instance, if updates start at 8pm each night, and take
12 hours, and live reports are needed between 7am and 8pm, then the
update is too slow, because it will not complete until 8am, and reports
will not be available from 7am to 8am. The downtime can be completely
eliminated by using separate installations of Sawmill (one for
reporting, and one for updating), but even then, updates can be too
slow, if they take more than 24 hours to update a day of data.
There are many ways of making database updates faster. This newsletter
lists common approaches, and discusses each.
1. Use a local file log source
If you're pulling your log data from an FTP site, or from an HTTP site,
or using a command line log source, Sawmill does not have as much
information to efficiently skip the data in the log files. It will
need to re-download and reconsider more data than if the files are
on a local disk, or a locally mounted disk. Using a local file log
source will speed up updates, by allowing Sawmill to skip previously
seen files faster. A "local file log source" includes mounted or shared
drives, including mapped drive letters, UNC paths, NFS mounts, and
AppleShare mounts; these are more efficient than FTP, HTTP, or command
line log sources, for skipping previously seen data.
2. Use a local file log source on a local disk
As mentioned above (1), network drives are still "local file log
sources" to Sawmill, because it has full access to examine all their
attributes as though they were local drives on the Sawmill server. This
gives Sawmill a performance boost over FTP, HTTP, and command log
sources. But with network drives, all the information still has to be
pulled over the network. For better performance, use a local drive, so
the network is not involved at all. For instance, on Windows, put the
logs on the C: drive, or some other drive physically inside the Sawmill
server. Local disk access is much faster than network access, so using
local files can significantly speed updates.
If the logs are initially generated on a system other than the Sawmill
server, they need to be transferred to the local disk before
processing, when using this approach. This can be done in a separate
step, using a third-party program. rsync is a good choice, and
works on all operating systems (on Windows, it can be installed as part
of cygwin). On Windows, DeltaCopy
is also a good choice. Most high-end FTP clients also support
scheduling of transfers, and incremental transfers (transferring only
files which have not been transferred earlier). The file transfers can
be scheduled to run periodically during the day; unlike database
updates, they can run during periods when reports must be available.
3. Turn on "Skip processed files on update by pathname"
During a database update, Sawmill must determine which log data it has
already imported into the database, and import the rest. By default,
Sawmill does this by comparing the first few kilobytes of each file
with the first few kilobytes of files which have been imported (by
comparing checksums). When it encounters a file it has seen before, it
checks if there is new data at the end of it, by reading through the
file past the previously seen data, and resuming reading when it
reaches the end of the data it has seen before. This is a very robust
way of detecting previously seen data, as it allows files to be
renamed, compressed, or concatenated after processing; Sawmill will
still recognize them as previously-seen. However, the algorithm
requires
Sawmill to look through all the log data, briefly, to determine
what it has seen. For very large datasets, especially datasets with
many files, this can become the longest part of the update process.
A solution is to skip files based on their pathnames, rather than their
contents. Under Config -> Log Data -> Log Processing, there is an
option "Skip processed files on update by pathname." If this option is
checked, Sawmill will look only at the pathname of a file when
determining if it has seen that data before. If the pathname matches
the pathname of a previously processed file, Sawmill will skip the
entire file. If the pathname does not match, Sawmill will process it.
Skipping based on pathnames takes almost no time, so turning this
option on can greatly speed updates, if the skipping step is taking a
lot of the time.
This will not work if any of the files in the log
source are growing. If the log source is a log file which is
being continually appended, Sawmill will put that log file's data into
the database, and will skip that file on the next update, even
though it has new data at the end now; because the pathname matches
(and with this option on, only the pathname is used to
determine what it new). So this option works best for datasets which
appear on the disk, one complete file at a time, and where files do not
gradually appear during the time when Sawmill might be updating.
Typically, this option can be used by processing log data on a local
disk, setting up file syncronization (see 2, above), and having it
synchronize only the complete files. It can also be used if logs are
compressed each day, to create daily compressed logs; then the
compressed logs are complete, and can be used as the log source, and
the uncompressed, growing log will be ignored because it does not end
with the compression extension (e.g., .zip, .gz, or .bz2). Finally,
there is another option, "Skip most recent file" (also under Config
-> Log Data -> Log Processing), which looks at the modification
date of each file in the log source (which works for "local file"
log sources only, but remember, that includes network drives), and
skips the file with the most recent modification date. This allows fast
analysis of servers, like IIS, which timestamp their logs, but do not
compress them or rotate them; only the most recent log is changing, and
all previous days' logs are fixed, so by skipping the most recent one,
we can safely skip based on pathnames.
4. Keep the new data in a separate directory
For fully automated Sawmill installations, there is often a scripting
environment built around Sawmill, which manages log rotation,
compression, import into the database, report generation, etc. In an
environment like this, it is usually simple to handle the "previously
seen data" algorithm at the master script level, by managing Sawmill's
log source so it only has data which has not been imported into the
database. This could be done by moving all processed logs to a
"processed" location (a separate directory or folder), after each
update; or it could be handled by copying the logs to be processed into
a "newlogs" folder, updating using that folder as the log source, and
then deleting everything in "newlogs" until the next update. By
ensuring, at the master script level, that the log source contains only
the new data, you can bypass Sawmill's skipping algorithm entirely, and
get the best possible performance.
5. Speed up the database build, or the merge
The choices above are about speeding up the "skip previously-seen data"
part of a database update. But database updates have three parts: they
skip the previously seen data, then build a separate database from the
new data, and then merge that database into the main database. Anything
that would normally speed up a database build, will speed up a database
update, and it will usually speed up the merge too. For instance,
deleting database fields, deleting cross-reference tables, rejecting
unneeded log entries, and simplifying database fields with log filters,
can all reduce the amount and complexity of data in the database,
speeding up database builds and updates.
With Enterprise licensing, on a system with multiple processors or
cores, it is also possible to set "Log processing threads" to 2, 4, or
more, in Config -> Log Data -> Log Processing. This tells Sawmill
to use multiple processors or cores during the "build" portion of the
database update (when it's building the separate database from the new
data), which can significantly improve the performance of that portion.
However, it increases the amount of work to be done in the "merge"
step, so using more threads does not always result in a speed increase
for updates.
Active-scanning anti-virus can severely affect the performance
of both builds and updates, by scanning Sawmill's database files
continually as it attempts to modify them. Performance can be 10x
slower in extreme cases, when active scanning is enabled. This is
particularly marked on Windows systems. If you have an anti-virus
product which actively scans all file system modifications, you should
exclude Sawmill's installation directory, and its database directories
(if separate) from the active scanning.
Use of a MySQL database has its advantages, but performance is not one
of them--Sawmill's internal database is at least 2x faster than MySQL
for most operations, and much faster for some. Unless you need MySQL
for some other reason (like to query the imported data directly with
SQL, from another program; or to overcome the address space limitations
of a 32-bit server), use the internal database for best performance of
both rebuilds and updates.
Finally, everything speeds up when you have faster hardware. A faster
CPU will improve update times, and a faster disk may have an ever
bigger affect. Switching from RAID 5 to RAID 10 will typically double
the speed up database builds and updates, and switching from 10Krpm to
15Krpm disks can give a 20% performance boost. Adding more memory can
help too, if the system is near its memory limit.
Questions or suggestions? Contact support@sawmill.net. If would
you
like a Sawmill Professional Services expert to implement this, or
another
customization, contact
consulting@sawmill.net.