Memory, Disk, and Time Usage

Sawmill processes a huge amount of data while building a database or displaying statistics. Because of this, it uses a lot of resources: disk space, memory, and processing time. You can customize Sawmill to use less of these resources, by using more of others. You can also customize Sawmill to use less of all resources by reducing the amount of data in your database.

This section describes the options that let you manage your memory, disk, and time resources. If you are thinking about running Sawmill on a public or shared server, you may want to check if they have a resource policy, to see if they allow high-CPU programs to be run there.

Building the database faster

A database is built or updated in three stages:

  • 1. The log data is processed, creating the main table.
  • 2. Then the cross-reference tables are built from the main table.
  • 3. Then the main table indices are created.
  • One way to speed up all of these processes is to use multiple processors. Depending on the version of Sawmill you're using, it may have the ability to split database builds across multiple processes, building a separate database with each processor from part of the dataset, and then merging the results. This can provide a significant speedup -- 65% speedup using two processors is fairly typical.

    Increasing the speed of your processor is also a very good way to increase the speed of database building -- database builds are primarily CPU-bound, so disk speed, memory speed, and other factors are not as important as the speed of the processor.

    If you've configured Sawmill to look up your IP numbers using Look up IP numbers using domain nameserver (DNS), the database building process will be slower than usual, as Sawmill looks up all the IP numbers in your log file. You can speed things up by not using look up IP numbers using DNS, by decreasing the DNS timeout, and/or by improving Sawmill's bandwidth to the DNS server.

    You can also speed up all three stages by simplifying the database structure, using log filters, or by eliminating database fields. For instance, if you add a log filter which converts all IP addresses to just the first two octets, it will be a much simpler field than if you use full IPs.

    Cross-reference tables can be eliminated entirely to improve database build performance. By eliminating cross-reference tables, you will slow query performance for those queries which would have used a cross-reference table. See Cross-Referencing and Simultaneous Filters for more details.

    Using less memory during database builds

    Sawmill keeps most information on disk, rather than in memory, so its memory usage is typically not very high (previous versions of Sawmill used memory more extensively, but recent ones are more frugal). However, Sawmill does cache some information in memory for better performance, and these caches can be reduced to use less memory if desired, using a custom value for Maximum memory usage type and Maximum memory usage size.

    If you're still running into memory errors, make sure you're using a 64-bit operating system. This lifts the 2GB limit, allowing Sawmill to use more RAM if it needs to. Even large datasets seldom actually reach 1GB per core, so 2GB per core is usually enough. But if you're running out of memory with 2GB of RAM per core, it may be time to go to a 64-bit OS.

    If a 64-bit OS isn't an option, you will need to simplify your database fields using log filters. For example, a filter which chops off the last octet of the IP will greatly reduce the number of unique IPs, probably dropping a huge 1GB item list under 100MB. Also, you may want to simply eliminate the troublesome field, if there is no need for it -- for instance, the uri-query field in web logs is sometime not needed, but tends to be very large. To determine which field is the problem, build the database until it runs out of memory, and then look at the database directory (typically in LogAnalysisInfo/Databases) to see which files are large. Pay particular attention to the 'items' directory -- if files in the xyz directory are particularly huge, then the xyz field is a problem.

    Another option is to use an external database server, so Sawmill runs on a low-memory system while the database server (MySQL, Oracle, or Microsoft SQL Server) runs on an external system with more resources. Sawmill's memory usage is mostly due to its internal database, so it uses little memory when the database is handled elsewhere.

    Finally, if you need to use less disk space or memory due to a quota on your web server, you may be able to get around this problem by running Sawmill on a local machine, where you dictate disk space constraints, and setting it to fetch the log data by FTP.