Creating Log Format Plug-ins (Custom Log Formats)

Sawmill supports most common log formats, but if you're using a log format which is not recognized by Sawmill, it is possible for you to custom create a log format for it. To create a custom log format is a fairly technical task in most cases, and it may prove challenging for a novice user. A technical user, especially one with programming and regular expression experience, should be able to create a plug-in by following the instructions below. It is particularly important to be at least familiar with regular expressions; several aspects of log format plug-ins involve creating regular expressions to match the log data.

The log formats supported by Sawmill are described in small text files found in the log_formats subdirectory of the LogAnalysisInfo directory. Each file in the log_formats directory defines a separate log format. To add a new log format to the list of formats supported by Sawmill, you need to create your own format file, and put in the log_formats directory.

To simplify the process, use the existing format files as templates. Look over a few of the built-in formats before you start creating your own. Most of the files look similar, and you will probably want to copy one of them and modify it to create your own format. However, as you compare the existing plug-ins to the examples below, you will often find that the format is not quite the same; in particular, you may find that the existing plug-ins are much more verbose that the syntax described below. You will also notice the difference in the log.fields, database.fields, and parsing filters, which have many unnecessary and redundant parameters in some plug-ins. We have been improving the syntax over the past few years to make it easier to create plug-ins, however many of the old plug-ins have not been updated to the new syntax. The old syntax is still supported but you should use the simpler syntax, as shown below, when creating new plug-ins.

If you would like us to create a plug-in for you, submit your request to us at support@sawmill.net. If the format is publicly available, support may be provided at no charge; and if so, it will be put into our log format queue. There is a continuous demand for new log file formats so there may be a significant delay before the plug-in is complete. For a faster response you may pay us to create the plug-in, which can be returned quickly. All plug-ins we create are property of Flowerfire, Inc., and we retain all rights, including unlimited distribution rights. Plug-ins we create are typically included in future versions of Sawmill.

If you have created a plug-in yourself, we encourage you to contribute it to our plug-in library, for inclusion in future versions of Sawmill. All contributed plug-ins become property of Flowerfire.

Steps to Create a Log Format Plug-in

The easiest way to create a plug-in is to start from a template, and edit each of the sections in order. Log format plug-ins contain the following major sections:

The log format name and label
The autodetection regular expression
Parsing the log data
Add the log fields
Add the database fields
Add the numerical database fields
Add the log filters
Add the database field associations
Add the report groups (automatic or manual approach)
Debug the plug-in

Additional topics:

Creating plug-ins for syslog servers or syslogging devices

The Log Format Name and Label

The log format name and the label are closely related. The log format label is the "human readable" name of the format. For instance, it might be "Common Access Log Format." The log format name is the "internal" version of that name, which is used internally to refer to it; it is typically a simpler version of the label, and it contains only letters, numbers, and underbars. For instance, if the format label is "Common Access Log Format," the name might be common_access. The name appears on the first line of the file, and the label appears in the log.format.format_label line of the file:

common_access = {

  # The name of the log format
  log.format.format_label = "Common Access Log Format"
  ...

The filename of the plug-in is also the name, with a ".cfg" extension. So the first thing you need to do when you create a plug-in is decide the label and name, so you can also name the file appropriately. Then copy another plug-in from the log_formats directory, naming the new one after your new plug-in. For instance, you might copy the common_access.cfg file, and rename the copy to my_log_format.cfg .

The Autodetection Regular Expression

The autodetection regular expression is used by the Create Profile Wizard to determine if a particular log format plug-in matches the log data being analyzed. The expression might look something like this:

  log.format.autodetect_regular_expression =
    "^[0-9]+/[0-9]+/[0-9]+ [0-9]+:[0-9]+:[0-9]+,[0-9.]+,[^,]*,[^,]*,[^,]*"

The regular expression specified here will be matched against the first few ten lines of log data (or more, if you also specify the number to match in a log.format.autodetect_lines option). If any of those lines matches this expression, this log format will be shown as one of the matching formats in the Create Profile Wizard.

When creating a log format plug-in, you'll typically want to set up this expression first, before you do anything else (other than create the plug-in file and choose the name and label). Once you've created the plug-in file, and set the expression, you can do the first testing of the plug-in by starting to create a new profile in the Create Profile Wizard, and seeing if the plug-in is listed as a matching format. If it's not listed, then the autodetect expression isn't right. It is best to get the autodetection working before you continue to the next step.

Because the expression appears in double-quotes (") in the plug-in, you need to escape any literal double-quotes you use in the regular expression, by entering them as \" . Because \ is used as an escape character, you also need to escape any \ you use as \\. For instance, if you want to match a literal [, that would be \[ in the regular expression, so you should escape the \ and enter it as \\[ in the quoted string.

Parsing the Log Data: Parsing with the Parsing Regular Expression

There are three ways of extracting field values from the log data: a parsing regular expression, a delimited approach (a.k.a., "index/subindex"), and parsing filters. Only one of these three approaches is typically used in a plug-in. This section describes parsing using a parsing regular expression. For a description of parsing with delimiters, see "Parsing With Delimited Fields". For a description of parsing with parsing filters, see "Parsing With Parsing Filters: Lines With Variable Layout" and "Parsing With Parsing Filters: Events Spanning Multiple Lines".

The parsing regular expression is matched against each line of log data, as the data is processed. This is easily confused with the autodetection regular expression, but the purpose of the two are very different. The autodetection regular expression is used only during the Create Profile Wizard, to show a list of formats which match the log data. The parsing regular expression is not used at all during profile creation; it is used only when log data is processed (e.g., during a database build) to extract field values from each line of the log data.

The expression should include subexpressions in parentheses to specify the field values. The subexpressions will be extracted from each matching line of data, and put into the log fields. For instance, this expression:

  log.format.parsing_regular_expression =
    "^([0-9]+/[0-9]+/[0-9]+) ([0-9]+:[0-9]+:[0-9]+),([0-9.]+),([^,]*),([^,]*),([^,]*)"

extracts a date field (three integers separated by slashes), then after a space, a time field (three integers separated by colons), then a series of four comma-separated fields; the first of the comma-separated fields is an IP address (any series of digits and dots), and the rest of the fields can be anything, as long as they don't contain commas.

Notice how similar the layouts of the parsing regular expression is to the autodetection regular expression. It is often possible to derive the parsing regular expression just by copying the autodetection expression and adding parentheses.

As each line of log data is processed it is matched to this expression and if the expression matches, the fields are populated from the subexpressions, in order. So in this case, the log fields must be date, time, ip_address, fieldx, fieldy, and duration in that order.

If the parsing regular expression is the one listed above, and this line of log data is processed:

  2005/03/22 01:23:45,127.0.0.1,A,B,30

Then the log fields will be populated as follows:

  date: 2005/03/22
  time: 01:23:45
  ip_address: 127.0.0.1
  fieldx: A
  fieldy: B
  duration: 30

Parsing With Delimited Fields

If your log data is extremely simple in its layout, you can use a faster parsing method, sometimes called delimited parsing or "index/subindex" parsing. In order to use this method, the log data must have the following characteristics:

Each event must appear on a separate line; events may not span multiple lines.
Each line must contain a list of fields separated by a delimiter. For instance, it might be a list of fields separated from each other by commas, or by spaces, or by tabs.

If the log data matches these criteria, you can use delimited parsing. The example we've been using above:

  2005/03/22 01:23:45,127.0.0.1,A,B,30

meets the first condition, but not the second, because the date and time fields are separated from each other by a space, but other fields are separated from each other by commas. But let's suppose the format was this:

  2005/03/22,01:23:45,127.0.0.1,A,B,30

This is now a format which can be handed by delimited parsing. To use delimited parsing, you must omit the log.format.parsing_regular_expression option from the plug-in; if this option is present, regular expression parsing will be used instead. You must also specify the delimiter:

  log.format.field_separator = ","

If the delimiter is not specified, the comma in this case, whitespace will delimit fields; any space or tab will be considered the end of one field and the beginning of the next.

When using delimited parsing, you also need to set the "index" value of each log field. The index value of the field tells the parse where that field appears in the line. Contrast this to regular expression parsing, where the position of the field in the log.fields like specifies the position in the line; delimited parsing pays no attention to where the field appears in the log.fields list, and populates fields based solely on their index (and subindex; see below). So for the comma-separated format described above, the log fields would look like this:

  log.fields = {
    date = {
      index = 1
    }
    time = {
      index = 2
    }
    ip_address = {
      type = "host"
      index = 3
    }
    fieldx = {
      index = 4
    }
    fieldy = {
      index = 5
    }
    duration = {
      index = 6
    }
  }

For brevity, this is usually represented with the following equivalent syntax; any group G which has only one parameter p, with value v, can be represented as "G.p = v":

  log.fields = {
    date.index = 1
    time.index = 2
    ip_address = {
      type = "host"
      index = 3
    }
    fieldx.index = 4
    fieldy.index = 5
    duration.index = 6
  }

For instance, field4 will be populated from the fourth comma-separated field, since its index is 4.

When using whitespace as the delimiter, quotation marks can be used to allow whitespace in fields. For instance, this line:

  2005/03/22 01:23:45 127.0.0.1 A "B C" 30

would extract the value "B C" into the fieldy field, even though there is a space between B and C, because the quote set it off as a single field value. It is in this case that the "subindex" parameter can be used; subindex tells the parser to split the quoted field one level deeper, splitting B and C apart into subindex 1 and subindex 2. So for the line above, if the fieldy field also had a "subindex = 1" parameter, fieldy would be set to "B"; if instead it had "subindex = 2", fieldy would be set to "C". If it has no subindex, or if subindex is 0, it will be set to the whole quoted field, "B C". This is used in Common Log Format (a web log format) to split apart the query string:

 ... [timestamp] "GET /some_page.html HTTP/1.0" 200 123 ...

In this case, the operation (GET) is extracted as subindex 1 of the quoted field, and the page (/some_page.html) is extracted as subindex 2, and the protocol (HTTP/1.0) is extracted as subindex 3.

Parsing With Parsing Filters: Lines With Variable Layout

Delimited parsing and regular expression parsing are both limited to log formats where each line has the same layout. But some log formats have a variable layout; some lines have one format, and some have another. Consider this bit of mail log data:

  2005/03/22,01:23:45,ICMP,127.0.0.1,12.34.56.78
  2005/03/22,01:23:46,TCP,127.0.0.1,3416,23.45.67.78,80

This firewall log shows two events. The first is a ICMP packet sent from 127.0.0.1 to 12.34.56.77; the second is a TCP packet sent from port 3416 of 127.0.0.1 to port 80 of 23.45.67.89. Because ICMP does not use ports, the first line is missing the port numbers which appear in the second line. This means that the format of the second line is different from the format of the first line, and the index positions of the fields vary from line to line; on the first line, the destination_ip field is at index 5, and on the second line, that same field is at position 6. This makes it impossible to parse this log format using delimited parsing. Similarly, it is difficult to create a single regular expression which can handle both formats although it's possible by making the port fields and one of the commas optional in the regular expression, but for the sake of the example, let's assume there is no regular expression which can handle both formats. Certainly, there are some formats which have such different line layouts that a single regular expression cannot handle all of them.

The solution is to use parsing filters. Parsing filters are code written in Salang: The Sawmill Language which extract data from the log line, and put it in the fields. To write parsing filters is similar to writing a script or a program, it does help if you have some programming experience, but simple parsing filters can be written by anyone.

Here is an example of a parsing filter which can parse this format:

  log.parsing_filters.parse = `
if (matches_regular_expression(current_log_line(),
      '^([0-9/]+),([0-9:]+),([A-Z]+),([0-9.]+),([0-9.]+)$')) then (
  date = $1;
  time = $2;
  protocol = $3;
  source_ip = $4;
  destination_ip = $5;
)
else if (matches_regular_expression(current_log_line(),
   '^([0-9/]+),([0-9:]+),([A-Z]+),([0-9.]+),([0-9]+),([0-9.]+),([0-9]+),$')) then (
  date = $1;
  time = $2;
  protocol = $3;
  source_ip = $4;
  source_port = $5;
  destination_ip = $6;
  destination_port = $7;
)
`

Note the backtick quotes (`) surrounding the entire expression; a log parsing filter is a single string value within backtick quotes.

This filter uses the matches_regular_expression() function to match the current line of log data (returned by the current_log_line() function) against the regular expression in single quotes. The first long line checks for the ICMP line layout with just two IPs addresses; the second long line checks for the TCP/UDP layout with ports. If the current line matches the ICMP layout, then this filter populates the date, time, protocol, source_ip, and destination_ip fields. It does this using the $N variables, which are automatically set by the matches_regular_expression() function when it matches; $1 is the first parenthesized expression (date), $2 is the second (time), etc. This parsing filter handles either log format, and populates the correct parts of each line into the appropriate fields.

The use of escaping backslashes in parsing filters is greatly multiplied than using backslashes in regular expressions. You may recall that a literal backslash in a regular expression had to be represented as \\ in the parsing regular expression, because it was inside double quotes. Parsing filters are within nested quotes. Typically, the entire parsing filter is in backtick (`) quotes, and the regular expression is within that in single or double quotes. Because of this, if you want to use a backslash to escape something in a regular expression, it has to be quadrupled, e.g., you need to use \\\\[ to get \[ (a literal left square bracket) in a regular expression. The final extreme is if you need a literal backslash in a regular expression; that is respresented as \\ in regular expression syntax, and each of those has to be quadrupled, so to match a literal backslash in a regular expression in a parsing filter, you need to use \\\\\\\\. For instance, to match this line:

  [12/Jan/2005 00:00:00] Group\User

You would need to use this parsing filter:

  if (matches_regular_expression(current_log_line(),
     '^\\\\[([0-9]+/[A-Z][a-z]+/[0-9]+) ([0-9:]+)\\\\] ([^\\\\]+)\\\\\\\\(.*)')) then
    ...

Parsing With Parsing Filters: Events Spanning Multiple Lines

There is one major limitation to all types of parsing discussed so far, all of these methods assume that there is one event per line. However many log formats split events across lines, so one field of the event may appear on the first line, and other fields may appear on later lines. To parse this type of log you need to use the collect/accept method with parsing filters. In this method fields are "collected" into virtual log entries and when all fields have been collected, the virtual log entry is "accepted" into the database.

Consider the following mail log:

  2005/03/22 01:23:45: connection from 12.34.56.78; sessionid=<156>
  2005/03/22 01:23:46: <156> sender: <bob@somewhere.com>
  2005/03/22 01:23:46: <156> recipient: <sue@elsewhere.com>
  2005/03/22 01:23:50: <156> mail delivered

This represents a single event; the mail client at 12.34.56.78 connected to the server, and sent mail from bob@somewhere.com to sue@elsewhere.com. Note the sessionid <156>, which appears on every line; log data where events span lines almost always has this type of "key" field, so you can tell which lines belong together. This is essential because otherwise the information from two simultaneous connections, with their lines interleaved, cannot be separated.

A parsing filter to handle this format could look like this:

  log.parsing_filters.parse = `
if (matches_regular_expression(current_log_line(),
      '^[0-9/]+ [0-9:]+: connection from ([0-9.]+); sessionid=<([0-9]+)>')) then
  set_collected_field($2, 'source_ip', $1);
else if (matches_regular_expression(current_log_line(),
           '^[0-9/]+ [0-9:]+: <([0-9]+)> sender: <([^>]*)>')) then
  set_collected_field($1, 'sender', $2);
else if (matches_regular_expression(current_log_line(),
           '^[0-9/]+ [0-9:]+: <([0-9]+)> recipient: <([^>]*)>')) then
  set_collected_field($1, 'recipient', $2);
else if (matches_regular_expression(current_log_line(),
           '^([0-9/]+) ([0-9:]+): <([0-9]+)> mail delivered')) then (
  set_collected_field($3, 'date', $1);
  set_collected_field($3, 'time', $2);
  accept_collected_entry($3, false);
)
`

This works similarly to the variable-layout example above, in that it checks the current line of log data against four regular expressions to check which of the line types it is. But instead of simply assigning 12.34.56.78 to source_ip, it uses the function set_collected_field() with the session_id as the "key" parameter, to assign the source_ip field of the collected entry with key 156 to 12.34.56.78. Effectively, there is now a virtual log entry defined, which can be referenced by the key 156, and which has a source_ip field of 12.34.56.78. If another interleaved connection immediately occurs, a second "connection from" line could be next, with a different key (because it's a different connection), and that would result in another virtual log entry, with a different key and a different source_ip field. This allows both events to be built, a field at a time, without there being any conflict between them.

Nothing is added to the database when a "connection from" line is seen; it just files away the source_ip for later use, in log entry 156 (contrast this to all other parsing methods discussed so far, where every line results in an entry added to the database). Now it continues to the next line, which is the "sender" line. The "connection from" regular expression doesn't match, so it checks the "sender" regular expression, which does match; so it sets the value of the sender field to bob@somewhere.com, for log entry 156. On the next line, the third regular expression matches, and it sets the value of the recipient field to sue@elsewhere.com, for log entry 156. On the fourth line, the fourth regular expression matches, and it sets the date and time field for log entry 156.

At this point, log entry 156 looks like this:

We have non-populated all fields, so it's time to put this entry in the database. That's what accept_collected_entry() does; it puts entry 156 into the database. From there, things proceed just as they would have if this had been a log format with all five fields on one line, extracted with a regular expression or with delimited parsing. So by using five "collect" operations, we have effectively put together a single virtual log entry which can now be put into the database in the usual way.

In order to use parsing filters with accept/collect, you must also set this option in the plug-in:

  log.format.parse_only_with_filters = "true"

If you don't include this line, the parser will use delimited or regular expression parsing first, and then run the parsing filters; typically, you want the parsing filters solely to extract the data from the line.

The parsing filter language is a fully general language; it supports variables, nested if/then/else constructs, loops, subroutines, recursion, and anything else you would expect to find in a language. Therefore there is no limit to what you can do with parsing filters; any log format can be parsed with a properly constructed parsing filter.

The Log Fields

The log fields are variables which are populated by the parsing regular expression; they are used to hold values which are extracted from the log data. They will be used later as the main variables which are manipulated by log filters, and if a log entry is not rejected by the log filters, the log field values will be copied to the database fields to be inserted into the database.

The log fields are listed in the log.fields section of the profile. For the example above, the log fields would look like this:

  log.fields = {

    date = ""
    time = ""
    ip_address = ""
    fieldx = ""
    fieldy = ""
    duration = ""

  } # log.fields

The name of the date field must be "date"; the name of the time field must be "time."

The log.format.date_format and log.format.time_format options describe the format of the date and time fields in the log data. For instance, a value of dd/mmm/yyyy for log.format.date_format means that the date will look something like this: 01/Jan/2006. If these options are omitted from the plug-in, they will default to "auto", which will do its best to determine the format automatically. This almost always works for time, but some date formats cannot be automatically determined; for instance, there is no way to guess whether 3/4/2005 means March 4, 2005 or April 3, 2005 (in this case, auto will assume the month is first). In cases where "auto" cannot determine the format, it is necessary to specify the format in the log.format.date_format option in the plug-in. Available formats are listed in Date format and Time format.

Usually, the date and time fields are listed separately, but sometimes they cannot be separated. For instance, if the date/time format is "seconds since January 1, 1970" (a fairly common format), then the date and time information is integrated into a single integer, and the date and time fields cannot be extracted separately. In this case, you will need to use a single date_time log field:

  log.fields = {

  date_time = ""
  ip_address = ""
  ...

} # log.fields

and both the log.format.date_format option and the log.format.time_format options should be set to the same value:

  log.format.date_format = "seconds_since_jan1_1970"
  log.format.time_format = "seconds_since_jan1_1970"

Fields which are listed without any parameters, like the ones above, will have default values of various parameters assigned to them. Most importantly, the label of the field will be set to "$lang_stats.field_labels.fieldname" where fieldname is the name of the field. For instance the label of the fieldx field will be set to "$lang_stats.field_labels.fieldx". This allows you to create plug-ins which are easily translated into other languages, but it also means that when you create a plug-in like this with no label value specified, you also need to edit the file LogAnalysisInfo/languages/english/lang_stats.cfg to add field labels for any new fields you have created. In that file, there is a large section called field_labels where you can add your own values to that section. For instance, in the case above, you would need to add these:

  fieldx = "field x"
  fieldy = "field y"

The other fields (date, time, ip_address, and duration) are already in the standard field_labels list. The standard list is very large, so you may find that all your fields are already there. If they aren't, you either need to add them, or explicitly override the default label by defining the log field like this:

  log.fields = {
    ...
    fieldx = {
      label = "field x"
    }
    ...
  } # log.fields

This extended syntax for the log field specifies the label in the plug-in, which means you don't need to edit lang_stats.cfg, but also means that the plug-in will not be translated into other languages. Plug-ins created for distribution with Sawmill should always use default field labels, and the field names should always be added to lang_stats, to allow for localization (translation to the local language).

The Database Fields

Sawmill stores information primarily in a single large table of the database called the Main Table. This table has one row per accepted log entry which usually means one row per line of log data, and one column per database field. The database.fields section of the plug-in lists these database fields.

The list of database fields is often similar to the list of log fields, but they aren't usually quite the same. Log fields list the fields which are actually present in the log data; database fields list the fields which are put into the database. These fields may be different if: 1) Some of the log fields are not interesting, and are therefore not tracked in the database or 2) A database field is based on a derived field (see below), which is not actually in the log data, or 3) The log field is numerical, and needs to be aggregated using a numerical database field (see next section).

Derived fields are "virtual" log fields which are not actually in the log data, but which are computed from fields which are in the log data. The following derived fields are available:

Log Field	Derived Log Field	Notes
date and time (both fields)	date_time	date_time is a field of the format dd/mmm/yyyy hh:mm:ss, e.g., 12/Feb/2006 12:34:50, which is computed from the date and time values in the log data.
date_time	hour_of_day	The hour of the day, e.g., 2AM-3AM.
date_time	day_of_week	The day of the week, e.g., Tuesday.
date_time	day_of_year	The day of the year, e.g., 1 for January 1, through 365 for December 31.
date_time	week_of_year	The week of the year, e.g., 1 for January 1 through January 8.
host	domain_description	A description of the domain for the host, e.g. "Commercial" for .com addresses. See the "host" comment below.
host	location	The geographic location of the IP address, computed by GeoIP database. See the "host" comment below.
agent	web_browser	The web browser type, e.g. "Internet Explorer/6.0". See the "agent" comment below.
agent	operating_system	The operating system, e.g. "Windows 2003". See the "agent" comment below.
agent	spider	The spider name, or "(not a spider)" if it's not a spider. See the "agent" comment below.
page	file_type	The file type, e.g., "GIF". See the "page" comment below.
page	worm	The worm name, or "(not a worm)" if it's not a worm. See the "page" comment below.

Note: "Host" derived fields are not necessarily derived from fields called "host"--it can be called anything. These fields are derived from the log field whose "type" parameter is "host". So to derive a "location" field from the ip_address field in the example above, the log field would have to look like this:

  ip_address = {
    type = "host"
  } # ip_address

You will sometimes see this shortened to this equivalent syntax:

  ip_address.type = "host"

Note: "Agent" derived fields are similar to "host", the "agent" field can have any name as long as the type is "agent". However, if you name the field "agent", it will automatically have type "agent", so it is not necessary to list it explicitly unless you use a field name other than "agent".

Note: "Page" derived fields: similar to "host", the "page" field can have any name as long as the type is "page". However, if you name the field "page" or "url", it will automatically have type "page", so it is not necessary to list it explicitly unless you use a field name other than "page" or "url".

Never include a date or time field in the database fields list! The database field should always be the date_time field, even if the log fields are separate date and time fields.

When creating the database fields list, it is often convenient to start from the log fields list. Then remove any log fields you don't want to track, and add any derived fields you do want to track, and remove any numerical fields (like bandwidth, duration, or other "counting" fields), which will be tracked in the numerical fields (next section). For the example above, a reasonable set of database fields is:

  database.fields = {

    date_time = ""
    ip_address = ""
    location = ""
    fieldx = ""
    fieldy = ""

  } # database.fields

Numerical Database Fields

Normal database fields cannot be summed or aggregated numerically; values like "Monday" and "Tuesday" cannot be combined into an aggregate value; nor can values like "/index.html" and "/pricing.html". Fields like this are tracked by normal, non-numerical database fields (previous section). However some fields lend themselves to aggregation; if you have a bytes field with a value of 5, and another bytes field with a value of 6, it is reasonable to add them to get total bytes of 11. Similarly, if you have a 30 second duration, and another 15 second duration, you can compute a total duration by totaling them up to get 45 seconds. Log fields which can be summed to get total values are listed in the numerical fields section of the plug-in.

For the example above, the numerical fields could be this:

  database.numerical_fields = {

    events = {
      label = "$lang_stats.field_labels.events"
      default = true
      requires_log_field = false
      type = "int"
      display_format_type = "integer"
      entries_field = true
    } # events

    duration = {
      label = "$lang_stats.field_labels.duration"
      default = true
      requires_log_field = true
      type = "int"
      display_format_type = "duration_compact"
    } # duration

  } # database.numerical_fields

This example lists two numerical fields, "events" and "duration". The duration field comes straight from the log data; it is just tracking and summing the duration values in the log data. The "events" field is a special field whose purpose is to count the total number of events, which typically means the total number of lines of log data; see below for details.

Most log formats have an "events" field which counts the number of events. Here is a list of names for events, depending on the log format:

There are other names for other formats. The name should be the best name for one "event" in the log. When in doubt, just use "events". This field functions like any other numerical field, but its value is set to 1 for every line, using a log filter. Therefore, almost all plug-ins have at least one log filter (see below for more about log filters):

  log.filters = {

    mark_entry = {
      label = '$lang_admin.log_filters.mark_entry_label'
      comment = '$lang_admin.log_filters.mark_entry_comment'
      value = 'events = 1'
    } # mark_entry

  } # log.filters

This log filter has a label and a comment so it will appear nicely in the log filter editor, but the real value of the filter is 'events = 1'; all the filter really does is set the events field to 1. Many plug-ins do not require any other log filters, but this one is almost always present. Make sure you always set the events field to 1! If you omit it, some or all entries will be rejected because they have no non-zero field values.

Since the events field is always 1, when it is summed, it counts the number of events. So if you have 5,000 lines in your dataset, the Overview will show 5,000 for events, or the sum of events=1 over all log entries.

The parameters for numerical fields are:

Name Purpose

label This is how the fields will appear in the reports, e.g., this will be the name of the columns in tables. This is typically $lang_stats.field_labels.fieldname; if it is, this must be defined in the field_labels section of LogAnalysisInfo/languages/english/lang_stats.cfg, or it will cause an error when creating a profile.

default "true" if this should be checked in the Numerical Fields page of the Create Profile Wizard.

requires_log_field "true" if this should only be included in the database if the corresponding log field exists. If this is "false", the log field does not have to exist in the log.fields list; it will be automatically added. If this is "true", and the field does not exist in log.fields, it will not be automatically added; instead, the numerical field will be deleted, and will not appear in the database or in reports.

type "int" if this is an integer field (signed, maximum value of about 2 billion on 32-bit systems); "float" if this is a floating point field (fractional values permitted; effectively no limit to size)

display_format_type

This specifies how a numerical quantity should be formatted. Options include:

`integer`	display as an integer, e.g., "14526554"
`duration_compact`	display as a compact duration (e.g., 1y11m5d 12:34:56)
`duration_milliseconds`	display as a duration in milliseconds (e.g., 1y11m5d 12:34:56.789)
`duration_microseconds`	display as a duration in microseconds (e.g., 1y11m5d 12:34:56.789012)
`duration_hhmmss`	display as a duration in h:m:s format (e.g., "134:56:12" for 134 hours, 56 minutes, 12 seconds)
`duration`	display as a fully expanded duration (e.g., "1 year, 11 months, 5 days, 12:34:56")
`bandwidth`	display as bandwidth (e.g., 5.4MB, or 22kb)
`float`	display as a floating point number, e.g., "123.4567"

For a full list of options see the documentation on format() at Salang: The Sawmill Language; this parameter is the "T" parameter of format().

aggregation_method "sum" if this field should be aggregated by summing all values; "average" if it should be aggregated by averaging all values; "max" if it should be aggregated by computing the maximum of all values; "min" if it should be aggregated by computing the minimum of all values. If not specified, this defaults to "sum". See below for more about aggregation.

average_denominator_field The name of the numerical database field to use as the denominator when performing the "average" calculation, when aggregation_method is average. This is typically the "entries" (events) field. See below for more information about aggregation.

entries_field "true" if this is the "entries" (events) field; omit if it is not

The numbers which appear in Sawmill reports are usually aggregated. For instance, in the Overview of a firewall analysis, you may see the number of bytes outbound, which is an aggregation of the "outbound bytes" fields in every log entry. Similarly, if you look at the Weekdays report, you will see a number of outbound bytes for each day of the week; each of these numbers is an aggregation of the "outbound bytes" field of each log entry for that day. Usually, aggregation is done by summing the values, and that's what happens in the "outbound bytes" example. Suppose you have 5 lines of log data with the following values for the outbound bytes: 0, 5, 7, 9, 20. In this case, if the aggregation method is "sum" (or if it's not specified), the Overview will sum the outbound bytes to show 41 as the total bytes.

In some cases it is useful to do other types of aggregation. The aggregation_method parameter provides three other types, in addition to "sum": "min", "max", and "average". "Min" and "max" aggregate by computing the minimum or maximum single value across all field values. In the example above, the Overview would show 0 as the value if aggregation method was "min", because 0 is the minimum value of the five. Similarly, it would show 20 as the value if the aggregation method was "max". If the aggregation method was "average", it would sum them to get 41, and then divide by the average_denominator_field value; typically this would be an "entries" (events) field which counts log entries, so its value would be 5, and the average value shown in the Overview would be 41/5, or 8.2 (or just 8, if the type is "int").

Log Filters

In the log filters section, you can include one or more log filters. Log filters are extremely powerful, and can be used to convert field values (e.g., convert destination port numbers to service names), or to clean up values (e.g. to truncate the pathname portion of a URL to keep the field simple), or to reject values (e.g. to discard error entries if they are not interesting), or just about anything else. There aren't many general rules about when to use log filters, but in general, you should create the "events" filter (see above), and leave it at that unless you see something in the reports that doesn't look quite right. If the reports look fine without log filters, you can just use the events filter; if they don't, a log filter may be able to modify the data to fix the problem.

Database Field Associations

Database fields in most log formats can have associations, a numerical field is relevant to a non-numerical field; for instance, the number of hits is a relevant number for the day of the week, URL, file type, worm, and all other non-numerical fields. For log formats like that (including the example above), there is no need for database field associations, and you can skip this section. For some plug-ins, especially plug-ins which combine multiple disparate formats into a single set of reports, there may be some numerical fields which are not relevant to some non-numerical fields. For instance, if there is a single plug-in which analyzes both error logs (with date, time, and error_message fields, and a number_of_errors numerical fields ), and access logs (with date, time, hostname, and page fields, and hits and page_views numerical fields), it does not make sense to ask how many errors there were for a particular hostname, or a particular page, because the error entries only contain date, time and error_message fields. In this case, you can use field associations to associate non-numerical fields with the numerical fields that track them. This makes reports more compact and more relevant, and makes cross-reference tables smaller, and eliminates all-zero columns from reports.

The database_field_associations node, like the report_groups node, goes inside create_profile_wizard_options. Here is an example, based on the integrated access/error log plug-in described above:

  create_profile_wizard_options = {

    # This shows which numerical fields are related to which non-numerical fields.
    database_field_associations = {

      hostname = {
        hits = true
        page_views = true
      }
      page = {
        hits = true
        page_views = true
      }
      error_message = {
        number_of_errors = true
      }      

    } # database_field_associations

    report_groups = {
      ...
    }

  } # create_profile_wizard_options

Field associations affect two parts of the final profile: the reports, and the cross-reference groups. In the example above, the default hostname and page reports will include only hits and page_views columns, and not a number_of_errors column, and the default error_message report will include only a number_of_errors column, and not a hits or page_views column. Since cross-reference groups are tables which optimize the generation of particular reports, the associated cross-reference groups for those reports will also not have the non-associated fields. Fields which are missing from database_field_associations are assumed to be associated with all numerical fields, so the date_time reports would show all three numerical fields in this case (which is correct, because all log entries, from both formats, include date_time information). If the database_field_associations node itself is missing, all non-numerical fields are assumed to be associated with all numerical fields.

Report Groups: Automatic Report Creation

The report_groups section of a plug-in lists the reports and groups them. This can be done in a simple format, described in this section, or in a more complex and much more flexible format, described in the next section. In the simple approach, Sawmill automatically creates an Overview report, a Log Detail report, session reports if session information is available, a Log Detail report, and a Single-page Summary report which contains all other reports; in this approach, the purpose of the report groups description is to specify where these reports appear in the reports menu. In the complex (manual) approach, the report_groups specifies not just the position of reports, but also which reports are created, and what options they use.

Using the simpler automatic format, here is a good choice for the example discussed above:

  create_profile_wizard_options = {

    # How the reports should be grouped in the report menu
    report_groups = {
      date_time_group = ""
      ip_address = true
      location = true
      fieldx = true
      fieldy = true
    } # report_groups

  } # create_profile_wizard_options

The outer group is called create_profile_wizard_options, and it has only one node in it: report_groups, which describe reports and where they should appear in the left menu. In general, the list of groups will start with date_time_group = "", which specifies that there should be a group in the menu with date/time reports like years/months/days, days, day of week, and hour of day; this is a special option which is automatically expanded to this:

  date_time_group = {
    date_time = true
    day_of_week = true
    hour_of_day = true
  }

The rest of the database fields can be listed below the date_time_group; you can just grab them from the database.fields section, and change "" to "true". That will put all the reports at the top level of the menu, below the date/time group.

If there are too many reports to fit in a simple flat menu, you can add groups. For instance, you could group fieldx and fieldy like this:

  report_groups = {
    date_time_group = ""
    ip_address = true
    location = true
    fieldxy_group = {
      fieldx = true
      fieldy = true
    }
  } # report_groups

This will put the fieldx and fieldy reports in separate menu group called fieldxy_group. When you create a new group, you must also define it in LogAnalysisInfo/languages/english/lang_stats.cfg, in the menu.groups section (search for "menu = {"), so the web interface knows what to call the menu group, e.g. :

  menu = {
    groups = {
      fieldxy_group = "Field X/Y"
      ...

Report Groups: Manual Report Creation

The automatic report creation approach described above is sufficient for simple plug-ins, but it has a number of significant drawbacks:

There is always one report per database field; it is not possible to omit the report for any field.
There is only one report per database field; it is not possible to create several reports for a field, with different options.
You can not add filters to reports.
It is not possible to customize which columns appear in reports; reports always contain one non-numerical field, and all associated numerical fields (see Database Field Associations).
Reports cannot be created with graphs, except date/time reports; it is not possible to create date/time reports without graphs.
It is not possible to override the number of rows, sort order, or any other report options.

The manual report creation approach described in this section overcomes all these limitations, because all reports, and their options, are specified manually. However, reports use default values for many options, so it is not necessary to specify very much information per report; in general, you only need to specify the non-default options for reports.

Here's an example of a very basic manual report creation, for the example above:

  create_profile_wizard_options = {

    # How the reports should be grouped in the report menu
    manual_reports_menu = true
    report_groups = {
 
      overview.type = "overview"

      date_time_group = {
        items = {
          date_time = {
            label = "Years/months/days"
            graph_field = "events"
            only_bottom_level_items = false
          }
          days = {
            label = "Days"
            database_field_name = "date_time"
            graph_field = "events"
          }
          day_of_week = {
            graph_field = "events"
          }
          hour_of_day = {
            graph_field = "events"
          }
        }
      } # date_time_group

      ip_address = true
      location = true
      fieldxy_group = {
        items = {
          fieldx = true
          fieldy = true
        }
      }

      log_detail = true
      single_page_summary = true

    } # report_groups

  } # create_profile_wizard_options

This has the same effect as the automatic report grouping described above.

Note:

The option "manual_reports_menu = true" specifies that manual report generation is being used.
The date_time group has been fully specified, as a four-report group.
The first report in the date_time group is the "Years/months/days" report, with label specified by lang_stats.miscellaneous.years_months_days (i.e., in LogAnalysisInfo/language/english/lang_stats.cfg, in the miscellaneous group, the parameter years_months_days), which graphs the "events" field and shows a hierarchical report (in other words, a normal "Years/months/days" report).
The second report in the date_time group is the "Days" report, with label specified by lang_stats.miscellaneous.days, which graphs the "events" field and shows a hierarchical report (in other words, a normal "Days" report).
The third report in the date_time group is a day_of_week report, which graphs the "event" field.
The fourth report in the date_time group is the hour_of_day report, which graphs the "event" field.
The ip_address, location, fieldx, and fieldy reports are specified the same as in automatic report creation, except for the addition of an "items" group within each group, which contains the reports in the group. Nothing is specified within the reports, so all values are default.
The log_detail and single_page_summary reports are specified manually (they will not be included if they are not specified here).

To simplify manual report creation, there are many default values selected when nothing is specified:

If no label is specified for a group, the label in lang_stats.menu.group.groupname is used (i.e., the value of the groupname node in the "group" node in the "menu" node of the lang_stats.cfg file, which is in LogAnalysisInfo/language/english). If no label is specified and the group name does not exist in lang_stats, the group name is used as the label.
If no label is specified for a report, and the report name matches a database field name, then the database field label is used as the report label. Otherwise, if lang_stats.menu.reports.reportname exists, that is used as the label. Otherwise, if lang_stats.field_labels.reportname exists, that is used as the label. Otherwise, reportname is used as the label.
If a "columns" group is specified in the report, that is used to determine the columns; the column field names are taken from the field_name value in each listed column. If it contains both numerical and non-numerical columns, it completely determines the columns in the report. If it contains only non-numerical columns, it determines the non-numerical columns, and the numerical columns are those associated with the database_field_name parameter (which must be specified explicitly in the report, unless the report name matches a database field name, in which case that database field is used as the database_field_name); see Database Field Associations. If no columns node is specified, the database_field_parameter is used as the only non-numerical column, and all associated numerical fields are used as the numerical columns.
If a report_menu_label is specified for a report, that value is used as the label in the reports menu; otherwise, the report label is used as the label in the reports menu.
If a "filter" is specified for a report, that filter expression is used as the report filter; otherwise, the report filter is used.
If only_bottom_level_items is specified for a report, the report shows only bottom level items if the value is true, or a hierarchical report if it is false. If it is not specified, the report shows only bottom level items.
If graph_field is specified for a report, a graph is included which graphs that field.
Any other options specified for a report will be copied over to the final report. For instance, graphs.graph_type can be set to "pie" to make the graph a pie chart (instead of the default bar chart), or "ending_row" can be set to change the number of rows from the default 20.

For instance, here is an advanced example of manual report grouping, again using the example above:

  create_profile_wizard_options = {

    # How the reports should be grouped in the report menu
    manual_reports_menu = true
    report_groups = {
 
      overview.type = "overview"

      date_time_group = {
        items = {
          days = {
            label = "Days"
            database_field_name = "date_time"
            graph_field = "duration"
          }
          day_of_week = {
            graph_field = "duration"
          }
        }
      } # date_time_group

      ip_address = true
      location = true
      fieldxy_group = {
        label = "XY"
        items = {

          fieldx = {
            label = "Field X Report (XY Group)"
            report_menu_label = "Field X Report"
          }

          fieldy = {
            sort_by = "events"
            sort_direction = "ascending"
            graph_field = "events"
            graphs.graph_type = "pie"
          } # fieldy

          fieldx_by_fieldy = {
            label = "FieldX by FieldY"
            ending_row = 50
            columns = {
              0.field_name = "fieldx"
              1.field_name = "fieldy"
              2.field_name = "events"
              3.field_name = "duration"
            } # columns
          } # fieldx_by_fieldy

        } # items

      } # fieldxy_group

      log_detail = true

    } # report_groups

  } # create_profile_wizard_options

Notes About the Example Above :

The date_time group has been simplified; the hierarchical years/months/days report has been removed, as has the hours of day report.
The label of the fieldxy group has been overriden to "XY"
The fieldx report has a custom label, "Field X Report (XY Group)", but it has a different label, "Field X Report" for its entry in the reports menu.
The fieldy report is sorted ascending by events, and includes an "events" pie chart.
A new report has been added, fieldx_by_fieldy (with label "FieldX by FieldY"), which is a 50-row table showing both fieldx and fieldy. This report will aggregate totals for each fieldx/fieldy pair, and will show the number of events and total duration for each pair, in an indented two-column table.
The single-page summary has been omitted.

Debugging

Log format plug-ins are almost always too complex to get right the first time; there is almost always a period of debugging after you've created one, where you fix the errors. By far the most useful debugging tool available is the command-line database build with the -v option. Once you've created a profile from the plug-in, build the database from the command line like this:

  sawmill -p profilename -a bd -v egblpfdD | more

(use Sawmill.exe on Windows). That will build the database, and while it's building, it will give detailed information about what it's doing. Look for the lines that start with "Processing" to see Sawmill looking at each line of log data. Look for the lines that start with "Marking" to see where it's putting data into the database. Look at the values it's putting in to the database to see if they look right. In between, look for the values it's extracting from the log data into the log fields, to be sure the fields values are what they should be. If you're using regular expressions to parse, Sawmill will show you what the expression is, and what it's matching it against, and whether it actually matched, and if it did, what the subexpressions were. Careful examination of this output will turn up any problems in the plug-in.

When you've found a problem, fix it in the plug-in, then run this to recreate your profile:

  sawmill -p profilename -a rp

Then rerun the database build with the command above, and repeat until everything seems to be going smoothly. If the data seems to be populating into the database properly, switch to a normal database build, without debugging output:

  sawmill -p profilename -a bd

When the build is done, look at the reports in the web interface; if you see any problems, you can return to the debugging output build to see how the data got in there.

When you have your log format plug-in complete, please send it to us! We'd be delighted to include your plug-in as part of Sawmill.

Creating Plug-ins for Syslog Servers and Syslogging Devices

To analyze log data logged to a syslog server in Sawmill, you must have two plug-ins: one to analyze the syslog header, and one to analyze the device's message. Since Sawmill supports so many syslog formats and syslogging devices already, it is not usually necessary to add both; usually, you'll find that the syslog format is already supported, so you only need to create the logging device plug-in, or vice versa. But if neither is supported already, you will need to create both plug-ins to process the data.

A syslog plug-in is slightly different from a normal plug-in. First, the log.miscellaneous.log_data_type option is set to "syslog":

  log.miscellaneous.log_data_type = "syslog"

Secondly, the plug-in should only define log fields and database fields which are in the syslog header. These always include date and time; often they include the logging device IP address, and sometimes they include the syslog priority or other fields. Syslog plug-ins must always use log parsing filters to collect field values into the collected entry with the empty key, for example:

  set_collected_field('', 'date', $1)

See above for information on using set_collected_field() to collect fields into entries. Syslog plug-ins must set the variable volatile.syslog_message to the message field. Syslog plug-ins should not accept entries; that is the responsibility of the syslogging device plug-in. Syslog plug-ins should always use "auto" as the date and time format; if the actual format is something else, they must use normalize_date() and normalize_time() to normalize the date and time into a format accepted by "auto". Other than that, syslog plug-ins are the same as other plug-ins.

A syslogging device plug-in (a.k.a., a syslog_required plug-in) is also slightly different from a normal plug-in. First, the log.miscellaneous.log_data_type option is set to "syslog_required":

  log.miscellaneous.log_data_type = "syslog_required"

Secondly, the plug-in should only define log fields and database fields which are in the syslog message. These vary by format, but do not include date, time, logging device IP, or syslog priority. Syslogging device plug-ins must always use log parsing filters. Since the syslog plug-in collected date and time into the empty key entry, syslogging device plug-ins must copy those over to another key if they use keyed collected entries. If they do not use keys, they can just collect all their fields into the collected entry. The syslogging device should accept collected entries.

Getting Help

If you have any problems creating a custom format, please contact support@sawmill.net -- we've created a lot of formats, and we can help you create yours. If you create a log format file for a popular format, we would appreciate it if you could email it to us, for inclusion in a later version of Sawmill.