{= include("docs.util"); start_docs_page(docs.technical_manual.page_titles.filters); =}
$PRODUCT_NAME offers a wide range of log filters, which let you selectively eliminate portions of your log data from the statistics, or convert values in log fields. Log filters are written in the configuration language (see {=docs_chapter_link('salang')=}). Log filters should not be confused with the Filters that appear in reports; log filters affect how the log data is processed, and report Filters affect which parts of the database data are displayed. There are many reasons you might want to filter the log data, including:
You may not be interested in seeing the hits on files of a particular type (e.g. image files, in web logs).
You may not be interested in seeing the events from a particular host or domain (e.g. web log hits from your own domain, or email from your own domain for mail logs).
For web logs, you may not be interested in seeing hits which did not result in separate page views, like 404 errors (file not found) or redirects.
$PRODUCT_NAME's default filters automatically perform the most common filtering (they categorize image files as hits but not page views, strip off page parameters, and more) but you will probably end up adding or removing filters as you fine-tune your statistics.
Filters are arranged in a sequence, like a computer program, starting with the first filter and continuing up through the last filter. Each time $PRODUCT_NAME processes a log entry, it runs the filters in order, starting with the first one. $PRODUCT_NAME applies that filter to the log entry. The filter may accept the log entry by returning "done", in which case it is immediately selected for inclusion in the statistics. If a filter accepts an entry, the other filters are not run; once a filter accepts, the acceptance is final. Alternately, the filter may reject the entry by returning "reject", in which case it is immediately discarded, without consulting any filters farther down the line. Finally, the filter may neither accept nor reject, but instead pass the entry on to another filter (by returning nothing); in this case, and only in this case, another filter is run.
In other words, every filter has complete power to pass or reject entries, provided the entries make their way to that filter. The first filter that accepts or rejects the entry ends the process, and the filtering is done for that entry. A filter gets to see an entry only when every filter before it in the sequence has neither accepted nor rejected that entry. So the first filter in the sequence is the most powerful, in the sense that it can accept or reject without consulting the others; the second filter is used if the first has no opinion on whether the entry should be accepted or rejected, etc.
The easiest way to create log filters is in the Log Filter Editor, in the Log Filters section of the Config. To get to the Log Filters editor, click Show Config in the Profiles list (or click Config in the reports), then click Log Data down the left, then click Log Filters. The Log Filters Editor lets you create new filters using a friendly graphical interface, within having to write advanced filter expressions. However, some filtering operations cannot be performance without advanced filter expressions, so the Log Filter Editor also provides an option to enter an expression.
The following filter rejects GIF files in web logs:
if (file_type eq 'GIF') then "reject"
The following filter ignores hits from your own web log domain:
if (ends_with(hostname, ".mydomain.com")) then "reject"
You can use a similar filter to filter out hits from a particular hostname:
if (hostname eq "badhost.somedomain.com") then "reject"
This type of filter can be used on any field, to accept and reject based on any criteria you wish.
Field names that appear in filters (like file_type or hostname above) should be exactly the field names as they appear in the profile (not the field label, which is used for display purposes only and might be something like "file type"). Field names never contain spaces, and are always lowercase with underbars between words.
The host filter above can be modified slightly to filter out entries based on any field. One common example is if you want to filter out hits on particular pages, for instance to discard hits from worm attacks. A filter like this:
if (starts_with(page, "/default.ida?")) then "reject"
rejects all hits on /index.ida, which eliminates many of the hits from the Code Red worm.
A filter like this:
if (!starts_with(page, "/directory1/")) then "reject"then continue on to the next filter
rejects all hits except those on /directory1/, which can be useful if you want to create a database which focuses on only one directory (sometimes useful for ISPs).
The following filter rejects entries before 2004:
if (date_time_to_epoc(date_time) < date_time_to_epoc('01/Jan/2004 00:00:00')) then "reject"
The following filter rejects entries older than 30 days (60*60*24*30 is the number of seconds in 30 days):
if (date_time < (now() - 60*60*24*30)) then "reject"
The following filter rejects all entries except those in 2003:
if ((date_time < '01/Jan/2003 00:00:00') or (date_time >= '01/Jan/2004 00:00:00')) then "reject"
The parameters on the page field (the part after the ?) are often of little value, and increase the size of the database substantially. Because of that, $PRODUCT_NAME includes a default filter that strips off everything after the ? in a page field (hint: if you need the parameters, delete the filter). $PRODUCT_NAME uses a special "replace everything after" filter for this use, but for the purpose of this example, here's another filter that does the same thing (but slower, because pattern matching is a fairly slow operation):
if (contains(page, "?")) then if (matches_regular_expression(page, "^(.*?).*\$")) then page = \$1 . "(parameters)"
This checks if the page contains a question mark; if it does, it matches the page to a regular expression with a parenthesized subexpression which is set to just the part before and including the question mark. The variable __HexEsc__241 is set automatically to the first parenthesized section, and this variable is used to set the page field to the part before and including the question mark, with "(parameters)" appended. For example, if the original value was /index.html?param1+param2, the result will be /index.html?(parameters). That is exactly what we wanted--the parameters have been stripped off, so all hits on index.html with parameters will have the same value, regardless of the parameters--and that will reduce the size of the database.
The filters look the same in profile files, so you can also edit a filter in the profile file using a text editor. You will need to use a backslash (\\) to escape quotes, dollar signs, backslashes, and other special characters if you edit the profile file directly.
{= end_docs_page() =}