FAQ: Log Entry Ordering

Does the log data I feed to Sawmill need to be in chronological order?

Short Answer

It depends on the format, but in most cases, the log data can be in any order.

Long Answer

Sawmill usually doesn't care what order the log data is in. For most common formats, which have one event per line of log data, Sawmill will just read the log data in any order, and if a reordering is required for some analysis (like web server sessions), it will automatically sort it before doing the analysis. Similarly, when using multiprocessor parsing, Sawmill will split the log data into chunks to distribute to each parsing server, and may parse or import them out of chronological order, without this causing any problems for the reports.

However, there are exceptions. If a log format has dependencies between lines of log data, e.g., if a line of data refers to previous lines of data in any way, then it may be necessary to process the logs in order, to get consistent results. Otherwise, at the boundaries between blocks of log data, it may not be possible to interpret the meaning of the first few lines, which depend on lines from other blocks, which might not have been processed yet, or might be analyzed simultaneously in other threads.

Examples of this type of dependency are Postfix logs, and many other mail server logs, which log "recipient" events on separate lines from "sender" events; Wowza and Flash media server logs, which report incremental bandwidth on each line which must be compared to previous lines to determine actual bandwidth usage by that event; and any log format plug-in which logs events across multiple lines (there are many, but they tend to be less frequently analyzed formats). Examples of log formats not affected are most common formats, including all web servers, all firewall or proxy or gateway servers, and all media servers except Flash and Wowza.

This "boundary problem" is unavoidable to some degree, since every log dataset has at least two boundaries, at the first line of log data and the last one. But it is exacerbated by out-of-order log file processing, and multiprocessor parsing, both of which introduce additional boundaries into the analysis.

A typical analysis will have a small number of boundaries, relative to the number of "good" lines of log data, so this issue can usually be ignored. However, it may result in slight differences in reported numbers from one build to the next, of the same dataset, when using multiprocessor parsing. In rare cases, the differences can be large.

If the boundary problem needs to be eliminated in a profile, it can be mostly resolved by turning off multiprocessor parsing (with Parsing server distribution method); this will eliminate all boundaries except those between files. If the intra-file boundaries are an issue (which can happen if the profile uses log filters to keep information from previous lines, and apply it to current lines), logs can be manually imported in chronological order, for instance by concatenating them to a single file and importing that file.

Database filters also provide a way of solving this problem in some cases. Since database filters, unlike log filters, operate on the database after it has been imported, and since they can sort the data before they operate, it is usually possible to process data in the required order, regardless of the log data order. The Sessions snap-on uses this technique to analyze the data chronologically, and in order of IP, without requiring the imported log data to be in any special order.