Using Log Filters

Sawmill offers a wide range of log filters, which lets you selectively eliminate portions of your log data from the statistics, or convert values in log fields. Log filters are written in the configuration language (see Salang: The Sawmill Language). Log filters should not be confused with the filters that appear in reports; log filters affect how the log data is processed, and report filters affect which parts of the database data are displayed. There are many reasons you might want to filter the log data, including:

You may not be interested in seeing the hits on files of a particular type (e.g. image files, in web logs).
You may not be interested in seeing the events from a particular host or domain (e.g. web log hits from your own domain, or email from your own domain for mail logs).
For web logs, you may not be interested in seeing hits which did not result in separate page views, like 404 errors (file not found) or redirects.

Sawmill's default filters automatically perform the most common filtering (they categorize image files as hits but not page views, strip off page parameters, and more) but you will probably end up adding or removing filters as you fine-tune your statistics.

How Filters Work

Filters are arranged in a sequence, like a computer program, starting with the first filter and continuing up through the last filter. Each time Sawmill processes a log entry, it runs the filters in order, starting with the first one. Sawmill applies that filter to the log entry. The filter may accept the log entry by returning "done", in which case it is immediately selected for inclusion in the statistics. If a filter accepts an entry, the other filters are not run; once a filter accepts, the acceptance is final. Alternately, the filter may reject the entry by returning "reject", in which case it is immediately discarded, without consulting any filters farther down the line. Finally, the filter may neither accept nor reject, but instead pass the entry on to another filter (by returning nothing); in this case, and only in this case, another filter is run.

In other words, every filter has complete power to pass or reject entries, provided the entries make their way to that filter. The first filter that accepts or rejects the entry ends the process, and the filtering is done for that entry. A filter gets to see an entry only when every filter before it in the sequence has neither accepted nor rejected that entry. So the first filter in the sequence is the most powerful, in the sense that it can accept or reject without consulting the others; the second filter is used if the first has no opinion on whether the entry should be accepted or rejected, etc.

Web Logs: Hits vs. Page Views

For web logs (web server and HTTP proxy), Sawmill distinguishes between "hits" and "page views" for most types of logs. A "hit" is one access to the web server; i.e. one request for a file (it may not actually result in the transfer of a file, for instance if it's a redirect or an error). A "page view" is an access to a page (rather than an image or a support file like a style sheet). For some web sites and some types of analysis, image files, .class files, .css file, and other files are not as important as HTML pages--the important number is how many pages were accessed, not how many images were downloaded. For other sites and other types of analysis, all accesses are important. Sawmill tracks both types of accesses. When a filter accepts an entry, it decides whether it is a hit or a page view by setting the "hits" and "page_views" fields to 1 or 0. Hits are tallied separately, and the final statistics can show separate columns for hits and page views in tables, as well as separate pie charts and graphs. Both hits and page views contribute to bandwidth and visitor counts, but the page view count is not affected by hits on image files and other support files.

The Log Filter Editor

The easiest way to create log filters is in the Log Filter Editor, in the Log Filters section of the Config. To get to the Log Filters editor, click Show Config in the Profiles list (or click Config in the reports), then click Log Data down the left, then click Log Filters. The Log Filters Editor lets you create new filters using a friendly graphical interface, within having to write advanced filter expressions. However, some filtering operations cannot be performance without advanced filter expressions, so the Log Filter Editor also provides an option to enter an expression.

Advanced Expression Examples

This section includes some examples of how filter expressions can be used, and how they are put together.

Example: Filtering out GIFs

The following filter rejects GIF files in web logs:

  if (file_type eq 'GIF') then "reject"

Example: Filtering out domains or hosts

The following filter ignores hits from your own web log domain:

  if (ends_with(hostname, ".mydomain.com")) then "reject"

You can use a similar filter to filter out hits from a particular hostname:

  if (hostname eq "badhost.somedomain.com") then "reject"

This type of filter can be used on any field, to accept and reject based on any criteria you wish.

Field names that appear in filters (like file_type or hostname above) should be exactly the field names as they appear in the profile (not the field label, which is used for display purposes only and might be something like "file type"). Field names never contain spaces, and are always lowercase with underbars between words.

Example: Filtering out pages or directories

The host filter above can be modified slightly to filter out entries based on any field. One common example is if you want to filter out hits on particular pages, for instance to discard hits from worm attacks. A filter like this:

  if (starts_with(page, "/default.ida?")) then "reject"

rejects all hits on /index.ida, which eliminates many of the hits from the Code Red worm.

A filter like this:

  if (!starts_with(page, "/directory1/")) then "reject"

then continue on to the next filter

rejects all hits except those on /directory1/, which can be useful if you want to create a database which focuses on only one directory (sometimes useful for ISPs).

Example: Filtering out events before a particular date range

The following filter rejects entries before 2004:

  if (date_time_to_epoc(date_time) < date_time_to_epoc('01/Jan/2004 00:00:00')) then "reject"

Example: Filtering out events older than a particular age

The following filter rejects entries older than 30 days (60*60*24*30 is the number of seconds in 30 days):

  if (date_time_to_epoc(date_time) < (now() - 60*60*24*30)) then "reject"

Example: Filtering out events outside a particular date range

The following filter rejects all entries except those in 2003:

  if ((date_time_to_epoc(date_time) < date_time_to_epoc('01/Jan/2003 00:00:00')) or (date_time_to_epoc(date_time) >= date_time_to_epoc('01/Jan/2004 00:00:00'))) then "reject"

Advanced Example: Converting the page field to strip off parameters

The parameters on the page field (the part after the ?) are often of little value, and increase the size of the database substantially. Because of that, Sawmill includes a default filter that strips off everything after the ? in a page field (hint: if you need the parameters, delete the filter). Sawmill uses a special "replace everything after" filter for this use, but for the purpose of this example, here's another filter that does the same thing (but slower, because pattern matching is a fairly slow operation):

  if (contains(page, "?")) then 
    if (matches_regular_expression(page, "^(.*?).*$")) then 
      page = $1 . "(parameters)"

This checks if the page contains a question mark; if it does, it matches the page to a regular expression with a parenthesized subexpression which is set to just the part before and including the question mark. The variable __HexEsc__241 is set automatically to the first parenthesized section, and this variable is used to set the page field to the part before and including the question mark, with "(parameters)" appended. For example, if the original value was /index.html?param1+param2, the result will be /index.html?(parameters). That is exactly what we wanted--the parameters have been stripped off, so all hits on index.html with parameters will have the same value, regardless of the parameters--and that will reduce the size of the database.

The filters look the same in profile files, so you can also edit a filter in the profile file using a text editor. You will need to use a backslash (\\) to escape quotes, dollar signs, backslashes, and other special characters if you edit the profile file directly.

Adjusting date/time values for Daylight Savings Time

By default, Sawmill reports the date and time values exactly as they appear in the log data, so it if shows 15:00, it will report that hit as 15:00. But if you're logging in GMT, and using Date offset to offset the time to your local time, the offset is constant year-round, and will not take daylight savings time into account. This filter adds an extra hour to the date/time field, between two dates in each year:

  v.date_time_year = substr(date_time, 7, 4);
  if ((date_time_to_epoc(date_time) >=
         date_time_to_epoc("11/Mar/" . v.date_time_year . " 00:00:00")) and
       (date_time_to_epoc(date_time) <
         date_time_to_epoc("04/Nov/" . v.date_time_year . " 00:00:00"))) then
    date_time = epoc_to_date_time(date_time_to_epoc(date_time) + 60*60);

Change 11/Mar and 04/Nov to the correct dates, in the expression above, and rebuild the database to compensate for daylight savings time.

Advanced Example: Adjusting date/time for Daylight Savings Time

If your log data does not automatically adjust its date/time for Daylight Savings Time, and if you need the hours to be exactly right in your log data, you can do the adjustment manually using a log filter. The filter will compute the range of Daylight Savings Time for each year, and use it to determine whether it should subtract an hour from the date/time value in the log data.

First, we added filter initialization (to log.filter_initialization in the profile .cfg file), to ensure that some necessary utility functions are loaded:

  # Initialize the filters
  filter_initialization = `
    include 'templates.shared.util.date_time.get_weekday';
    include 'templates.shared.util.date_time.get_month_as_number';
  `

Now add a filter (to log.filters in the profile .cfg file) to compute DST ranges, and subtract when necessary:

  log.filter.dst_adjust = `

    # Make sure v.dst_info exists; that's where we'll store the start and end dates of DST for each year
    v.dst_info = "";
    
    # Extract the year from the date/time
    int year;
    if (matches_regular_expression(date_time, "^([0-9]+)/([A-Za-z]+)/([0-9]+) ")) then
      year = 3;
    
    # Get the subnode for this year, with DST range for this year
    node dst_info_year = subnode_by_name('v.dst_info', year);
    
    # If we haven't computed the start and end of DST for this year, do it now
    if (!subnode_exists(dst_info_year, 'start_date_time_epoc')) then (
    
      # Iterate through the year, starting on Jan 1, looking for the daylight savings range.
      # This could start on Mar 7 more efficiently, if we assume DST starts on the first Sunday in March,
      # but this way it can support any daylight savings range.
      # Iterate until we've found the endpoints, or until we've gone 365 iterations (no infinite loops).
      int first_second_of_this_day_epoc = date_time_to_epoc('01/Jan/' . year . ' 00:00:00');
      int second_sunday_of_march = 0;
      int first_sunday_of_november = 0;
      int sundays_found_in_march = 0;
      int sundays_found_in_november = 0;
      int iterations = 0;
      while (((second_sunday_of_march == 0) or (first_sunday_of_november == 0)) and !(iterations > 365)) (
    
        # Compute the first second of this day in date/time format (dd/mmm/yyyy hh:mm:ss)
        string first_second_of_this_day_date_time = epoc_to_date_time(first_second_of_this_day_epoc);
    
        # Break it into day, month, and year
        if (matches_regular_expression(first_second_of_this_day_date_time, "^([0-9]+)/([A-Za-z]+)/([0-9]+) ")) then (
    
          int day = 1;
          string monthname = 2;
          int monthnum = get_month_as_number(first_second_of_this_day_date_time);
          int year = 3;
    
          # Compute the day of week
          string weekday = get_weekday(year, monthnum, day);
    
          # If this is the second Sunday in March, it's the starting point of DST.
          if ((monthname eq "Mar") and (weekday eq "Su")) then (
            sundays_found_in_march++;
            if (sundays_found_in_march == 2) then
              second_sunday_of_march = first_second_of_this_day_epoc;
          );
    
          # If this is the first Sunday in November, it's the ending point of DST
          if ((monthname eq "Nov") and (weekday eq "Su")) then (
            sundays_found_in_november++;
            if (sundays_found_in_november == 1) then
              first_sunday_of_november = first_second_of_this_day_epoc;
          );
    
        ); # if valid date_time
    
        # Go to the next day
        first_second_of_this_day_epoc += 24*60*60;
        iterations++;
    
      ); # while haven't found start and end
    
      # Compute the first and last second of the DST range, in date_time format
      string first_second_of_second_sunday_of_march = epoc_to_date_time(second_sunday_of_march);
      string last_second_of_first_sunday_of_november = substr(epoc_to_date_time(first_sunday_of_november), 0, 11) . ' 23:59:59';
    
      # Remember the endpoints of the range in the node fo rthis year
      set_subnode_value(dst_info_year, 'start_date_time_epoc', date_time_to_epoc(first_second_of_second_sunday_of_march));
      set_subnode_value(dst_info_year, 'end_date_time_epoc', date_time_to_epoc(last_second_of_first_sunday_of_november));
    
    ); # if this year not computed
    
    # Get the endpoints of the range from this year's subnode
    int start_date_time_epoc = node_value(subnode_by_name(dst_info_year, 'start_date_time_epoc'));
    int end_date_time_epoc = node_value(subnode_by_name(dst_info_year, 'end_date_time_epoc'));
    
    # If this date is within the DST range, subtract an hour
    if ((date_time_to_epoc(date_time) >= start_date_time_epoc) and (date_time_to_epoc(date_time) <= end_date_time_epoc)) then
      date_time = epoc_to_date_time(date_time_to_epoc(date_time) - 60*60);

  ` # <- closing quote of log.filter.dst_adjust

This filter is not optimally fast--it would be better to move the variable declarations to filter_initialization, since they take some time. But it should be reasonably fast written this way (most of the code runs only once per log data year), and it is easier to read.

Advanced Example: Rewriting usernames to full names using a CFG file

If your log data contains usernames like bob and jill, and you want to list their full names in reports instead, you can use it by creating a CFG file which maps usernames to full names, and referring to it from the log filter. For instance, you could call the file usernames_to_full_names.cfg, and put it in LogAnalysisInfo, and it could contain this:

  usernames_to_full_names = {
    bob = "Bob Smith"
    jill = "Jill Teti"
    george = "George Jones"
  } # usernames_to_full_names

The first line must match the filename (without the .cfg). The final line ends with a # followed by a comment; this is optional, but it is a good idea to use it so you always know what opening curly bracket you are closing with a closing curly.

This file can be as long as you wish; it can have one entry, or millions.

Now the log filter can be this:

  if (subnode_exists("usernames_to_full_names", username)) then
    username = node_value(subnode_by_name("usernames_to_full_names", username));

This filter uses subnode_exists() to check if there is an entry in usernames_to_full_names for the current username (the value of the "username" log field; this assumes that the field is called "username"). If there is, it uses subnode_by_name() to get the node corresponding to that user, in usernames_to_full_names, and then uses node_value() to get the value of the node (the full name). It assigns that value to the username field, overwriting it with the full name.

Advanced Example: Adding external metadata to a profile

An extension of the above technique can be used to integrate external metadata with the log data in reports. For example, if you have a database which maps usernames to full names, departments, and states, and if you can convert that database to CFG format, then you can create a profile in Sawmill which reads log data containing only usernames, but reports (and aggregated data by) full names, departments, and states. The CFG file might be called user_records.cfg (in LogAnalysisInfo), and could look like this:

  user_records = {
    bob = {
      full_name = "Bob Smith"
      department = "Mathematics"
      state = "Texas"
    } # bob
    jill = {
      full_name = "Jill Teti"
      department = "Art"
      state = "Ohio"
    } # jill
    george =  = {
      full_name = "George Jones"
      department = "Literature"
      state = "Ohio"
    } # george
  } # user_records

And the log filter could look like this:

  if (subnode_exists("user_records", username)) then (
    node user_record = subnode_by_name("user_records", username);
    full_name = node_value(subnode_by_name(user_record, "full_name"));
    department = node_value(subnode_by_name(user_record, "department"));
    state = node_value(subnode_by_name(user_record, "state"));
  );

This filter uses subnode_exists() to check for the existence of a user record for the username; if it exists, it uses subnode_by_name() to get a pointer to that record (node). Then it uses subnode_by_name() and node_value() on the user record node, to extract the full_name, department, and state values. It puts them in custom fields (full_name, department, and state), so these fields must also be created (see Creating Custom Fields).

This technique is extremely powerful; it can be used to effectively "join" log data to an existing database table, by exporting that table to a CFG file, and then look up values from that table using a common key found both in the table and the log data.

This method is quite efficient; it does not iterate through the entire list, but looks it up using a "binary search" approach which is usually fast enough that the performance of the lookup is not a significant factor in database build time.

Advanced Example: Tracking new sessions in a web log

This is an example which uses an in-memory configuration node to store a list of all IP addresses seen so far, so when a new one is encountered, the filter can set a custom new_sessions field to 1. This causes the new_sessions field to be 1 for all new-IP events, and 0 otherwise, so date/time reports will show how many new visitors are coming each year/month/day/hour/etc.

  v.c_ip = replace_all(c_ip, ".", "_");
  if (!node_exists("previously_seen_ips")) then
    previously_seen_ips = "";
  if (!subnode_exists("previously_seen_ips", v.c_ip)) then (
    new_sessions = 1;
    set_subnode_value("previously_seen_ips", v.c_ip, true);
  );

The first line uses replace_all(), to compute a variable v.c_ip from the c_ip field, by replacing dots (.) with underbars (_). This is necessary because configuration nodes use dots as the separator, so dots cannot be used in configuration node names; we're about to add nodes with names like v.previously_seen_ips.{c_ip}, so we need to make sure c_ip does not contains dots.

The next two line uses node_exists() to check to see if the node v.previous_seen_ips exists; if it doesn't, it creates it (assigning a value to an undefined node defines it). Without this step, the next line, which checks for subnodes of v.previously_seen_ips, would fail with an error that v.previously_seen_ips does not exist.

The next part uses subnode_exists() to check if there is a subnode of v.previously_seen_ips which matches the c_ip field (with dots converted to underbars). For instance, if the log data contains 12.34.56.78 as the IP, this will check for the existence of a node v.previously_seen_ips.12_34_56_78. If this node exists, then we know that this IP has appeared previously, and this is not a new session. If this node does not exist, then we create it using set_subnode_value(), and then set new_sessions to 1, so this event will appear as a new session event in the reports.

The code above uses an in-memory node for previously_seen_ips, so it won't remember which IPs it has seen for the next database update; that's fine as long as you rebuild, but if you want to update you'll want to save the node to disk, and reload it for each update. This can be done using the log.filter_finalization node in the profile (or plug-in), which specifies code to run at the end of log processing. The following example saves previously_seen_ips to disk:

  log.filter_finalization = `save_node('previously_seen_ips')`

The node will be saved to the file previously_seen_ips.cfg, in the LogAnalysisInfo directory . Since nodes are automatically loaded from disk when they are accessed, future updates will automatically load this node into memory when the log filter first accesses it, so there is no need for a "load_node()" operation.

Advanced Example: Rejecting spiders based on JS and /robots.txt access

NOTE: This example has been packaged up conveniently in the "Two-Stage Spider Analysis" snapon. It is included here as an example, but there is no need to implement it manually; just attach "Two-Stage Spider Analysis".

Sawmill detects spiders automatically based on the spiders.cfg file in LogAnalysisInfo, but not all spiders are in that file, and some spiders deliberately hide themselves (they do not declare themselves as spiders in their user-agent field). This section describes a more sophisticated way of identifying spiders, based on whether they hits JavaScript pages, or /robots.txt.

The /robots.txt file is a file which describes what a spider may do on the site; all well-behaved spiders must access this file before accessing anything else. Therefore, a straightforward way of rejecting spiders is to collect a list of all accesses to /robots.txt, and to reject all IPs which accessed /robots.txt. But that assumes the spider is well-behaved, which may not be the case. So another level of detection is useful, if your site uses JavaScript or CSS: look for all visitors who never accessed a JavaScript or CSS file (.js or .css file). Spiders typically do not access JavaScript or CSS files, so if you see an IP which accessed the site, but never accessed any .js file or .css file, they are likely a spider (if your site uses JavaScript or CSS!). The following algorithm rejects all hits from all IPs which either never hit a JS or CSS file, or do hit the /robots.txt file.

First, we added filter initializion and finalization to the profile, to log.filter_initialization and log.filter_finalization:

    # Initialize the filters
    filter_initialization = `
spider_rejection.accessed_js_css_file = '';
spider_rejection.accessed_robots_txt = '';
`
    # Finalize the filters
    filter_finalization = `
save_node('spider_rejection.accessed_js_css_file');
save_node('spider_rejection.accessed_robots_txt');
`

The initialization code makes sure the spider_rejection.accessed_js_css_file and spider_rejection.accessed_robots_txt nodes exist. Note: you will need to create a directory called spider_rejection, in the LogAnalysisInfo directory, to hold these nodes. The finalization step saves these nodes, to LogAnalysisInfo/spider_rejection/accessed_js_css_file.cfg and LogAnalysisInfo/spider_rejection/accessed_robots_txt.cfg.

Now add a filter (to log.filters in the profile .cfg file) to detect CSS/JS hits, and to reject spiders.

reject_spiders = `

# Convert . to _ in the IP
v.converted_ip = replace_all(c_ip, '.', '_');

# If this is a JS file, remember that this IP accessed it.
if ((file_type eq 'JS') or (file_type eq 'CSS')) then (
  set_node_value(subnode_by_name('spider_rejection.accessed_js_css_file', v.converted_ip), true);
);

# If this is /robots.txt, remember that this IP accessed it.
if (cs_uri_stem eq '/robots.txt') then (
  set_node_value(subnode_by_name('spider_rejection.accessed_robots_txt', v.converted_ip), true);
);

# Reject as spiders any hit which did not access JS/CSS files, or did access /robots.txt
if (subnode_exists('spider_rejection.accessed_robots_txt', v.converted_ip) or
    !subnode_exists('spider_rejection.accessed_js_css_file', v.converted_ip)) then (
  'reject';
);

`

This filter does the following:

It creates and sets a variable v.converted_ip, to the IP address (which is called c_ip in this IIS profile; it may be called something else for other log formats), with dots converted to underbars. This is necessary because node names cannot contain dots.
It checks if the hit is a JS or CSS hit; if it is, it sets spider_rejection.accessed_js_css_file.ip to true, where ip is the converted IP address.
It checks if the hit is on /robots.txt (again using IIS field names; this would be called "page" for Apache); if it is, it sets spider_rejection.accessed_robots_txt.ip to true, where ip is the converted IP address.
It checks if the current IP address is in the accessed_robots_txt list, or is not in the accessed_js_css_file list; if so, it rejects the hit.

The first time a dataset is processed with these log filters, this will cause the accessed_robots_txt and accessed_js_css_file lists to be populated. The data then must be processed a second time, since it is not possible to know the first time an IP address is encountered, whether that IP will ever hit a JS file. The second time it is processed, all data has been processed once, and the lists are complete, so the spider rejection will work properly. So the database needs to be built twice to remove all spiders.