{= include("docs.util"); start_docs_page(docs.technical_manual.page_titles.resources); =}
Manage your Memory, Disk and Time Resources
$PRODUCT_NAME processes a huge amount of data while building a database or displaying statistics. Because of this, it uses a lot of resources: disk space, memory, and processing time. If you are considering running $PRODUCT_NAME on a public or shared server, you may want to investigate their resource policy to see if they allow high-CPU programs to be run there.
However, you can customize $PRODUCT_NAME to use less of some of these resources, by using more of others. You can also customize $PRODUCT_NAME to use less of all resources by reducing the amount of data in your database. This section describes the options that let you manage your memory, disk, and time resources.
A database is built (or updated) in three stages: The main table is created from the processing of the log data, then the cross-reference tables are built from the main table and then the main table indices are created.
One way to speed up all of these processes is to use multiple processors. Depending on the version of $PRODUCT_NAME you're using, it may have the ability to split database builds across multiple processes, building a separate database with each processor from part of the dataset, and then merging the results. This can provide a significant speed up -- 65% speed up using two processors is fairly typical.
Increasing the speed of your processor is also a very good way to increase the speed of database building -- database builds are primarily CPU-bound, so disk speed, memory speed, and other factors are not as important as the speed of the processor.
If you've configured $PRODUCT_NAME to look up your IP numbers (using {=docs_option_link('luin')=}), the database building process will be slower than usual, as $PRODUCT_NAME looks up all the IP numbers in your log file. You can speed things up by not using {=docs_option_link('luin')=}, by decreasing the {=docs_option_link('dt')=}, and/or by improving $PRODUCT_NAME's bandwidth to the DNS server.
You can also speed up all three stages by simplifying the database structure, by eliminating database fields or by using log filters to simplify them. For instance, if you add a log filter which converts all IP addresses to just the first two octets, it will be a much simpler field than if you use full IPs.
Cross-reference tables can be eliminated entirely to improve database build performance; however by eliminating them, you will slow query performance for those queries which would have used a cross-reference table. See {=docs_chapter_link('xref')=} for more details.
For most large datasets, the major factor in memory usage during builds are the item lists. There is one list per field, and each list includes every value for that field. For large fields, these lists can be huge -- for instance, if there are 100 million unique IPs, and each IP is 10 bytes long (e.g. "123.124.125.126" is 15 bytes long), then the total memory required for that list will be 100M * 10, or 1G of memory. $PRODUCT_NAME uses memory-mapped files for these lists, so depending on the operating system's implementation of memory mapped files, these may appear to be normal memory usage, virtual memory usage, or something else. However, most 32-bit operating systems restrict each process to 2G of memory space, including mapped files, so it doesn't take too much to exceed that.
The most complete solution is to get more memory, and to use a 64-bit operating system, this lifts the 2G limit. Even large datasets seldom actually reach 1G for the largest item list, and it is usually only a handful of fields which are large, so 2G is usually enough. But if you're running out of memory with 2G of RAM, it may be time to go to a 64-bit OS.
If a 64-bit OS isn't an option, you will need to simplify your database fields using log filters. For example, a filter which chops off the last octet of the IP will greatly reduce the number of unique IPs, probably dropping a huge 1G item list under 100M. Also, you may want to simply eliminate the troublesome field, if there is no need for it -- for instance, the uri-query field in web logs is sometime not needed, but tends to be very large. To determine which field is the problem, build the database until it runs out of memory, and then look at the database directory (typically in LogAnalysisInfo/Databases) to see which files are large. Pay particular attention to the 'items' $lang_stats.directory -- if files in the xyz $lang_stats.directory are particularly huge, then the xyz field is a problem.
Finally, if you need to use less disk space or memory due to a quota on your web server, you may be able to get around this problem by running $PRODUCT_NAME on a local machine, where you dictate disk space constraints, and setting it to fetch the log data by FTP.
{= end_docs_page() =}