{=
  include("docs.util");
  start_docs_page(docs.technical_manual.page_titles.filters);
=}

<p>$PRODUCT_NAME offers a wide range of log filters, which let you selectively
eliminate portions of your log data from the statistics,
or convert values in log fields.  Log filters are written in the configuration language (see {=docs_chapter_link('salang')=}).
Log filters should not be confused with the Filters that appear in reports; 
log filters affect how the log data is processed, and report Filters affect 
which parts of the database data are displayed. 
There are many reasons you might want to filter the log data, including:</p>
<ul>
  <li> <p>You may not be interested in seeing the hits on files of a particular
       type (e.g. image files, in web logs).</p>
  <li> <p>You may not be interested in seeing the events from a particular host or
       domain (e.g. web log hits from your own domain,
       or email from your own domain for mail logs).</p>
  <li> <p>For web logs, you may not be interested in seeing hits which did not result
       in separate page views, like 404 errors (file not found) or
       redirects.</p>
</ul>

<p>$PRODUCT_NAME's default filters automatically perform the most common filtering (they categorize image files as hits but not page views, strip off page parameters, and more) but you will probably end up adding or removing filters as you fine-tune your statistics.</p>

<h3>How Filters Work</h3>

<p>Filters are arranged in a sequence, like a computer program, starting with the first filter and continuing up through the last filter.  Each time $PRODUCT_NAME processes a log entry, it runs the filters in order, starting with the first one. $PRODUCT_NAME applies that filter to the log entry.  The filter may accept the log entry by returning "done", in which case it is immediately selected for inclusion in the statistics.  If a filter accepts an entry, the other filters are not run; once a filter accepts, the acceptance is final.  Alternately, the filter may reject the entry by returning "reject", in which case it is immediately discarded, without consulting any filters farther down the line.  Finally, the filter may neither accept nor reject, but instead pass the entry on to another filter (by returning nothing); in this case, and <i>only</i> in this case, another filter is run.</p>

<p>In other words, every filter has complete power to pass or reject entries, provided the entries make their way to that filter.  The first filter that accepts or rejects the entry ends the process, and the filtering is done for that entry.  A filter gets to see an entry <i>only</i> when every filter before it in the sequence has neither accepted nor rejected that entry.  So the first filter in the sequence is the most powerful, in the sense that it can accept or reject without consulting the others; the second filter is used if the first has no opinion on whether the entry should be accepted or rejected, etc.</p>

<h3>Web Logs: Hits vs. Page Views</h3>
For web logs (web server and HTTP proxy), $PRODUCT_NAME distinguishes between "hits" and "page views" for most types of logs.  A "hit" is one access to the web server; i.e. one request for a file (it may not actually result in the transfer of a file, for instance if it's a redirect or an error). A "page view" is an access to a page (rather than an image or a support file like a style sheet).  For some web sites and some types of analysis, image files, .class files, .css file, and other files are not as important as HTML pages--the important number is how many pages were accessed, not how many images were downloaded.  For other sites and other types of analysis, all accesses are important. $PRODUCT_NAME tracks both types of accesses.  When a filter accepts an entry, it decides whether it is a hit or a page view by setting the "hits" and "page_views" fields to 1 or 0.  Hits are tallied separately, and the final statistics can show separate columns for hits and page views in tables, as well as separate pie charts and graphs. Both hits and page views contribute to bandwidth and visitor counts, but the page view count is not affected by hits on image files and other support files.</p>

<h3>The Log Filter Editor</h3>
<p>The easiest way to create log filters is in the Log Filter Editor, in the Log Filters section of the Config.
To get to the Log Filters editor, click Show Config in the Profiles list (or click Config in the reports),
then click Log Data down the left, then click Log Filters.  The Log Filters Editor lets you create new filters
using a friendly graphical interface, within having to write advanced filter expressions.  However, some
filtering operations cannot be performance without advanced filter expressions, so the Log Filter Editor
also provides an option to enter an expression.</p>

<h3>Advanced Expression Examples</h3>
This section includes some examples of how filter expressions can be used, and how they are put together.</p>

<h4>Example: Filtering out GIFs</h4>
<p>The following filter rejects GIF files in web logs:</p>
<pre>  if (file_type eq 'GIF') then "reject"</pre>

<h4>Example: Filtering out domains or hosts</h4>
<p>The following filter ignores hits from your own web log domain:</p>
<pre>  if (ends_with(hostname, ".mydomain.com")) then "reject"</pre>
<p>You can use a similar filter to filter out hits from a particular hostname:</p>
<pre>  if (hostname eq "badhost.somedomain.com") then "reject"</pre>

<p>This type of filter can be used on any field, to accept and reject 
based on any criteria you wish.</p>

<p>Field names that appear in filters (like file_type or hostname above) should be exactly
the field names as they appear in the profile (not the field label, which is used for
display purposes only and might be something like "file type").  Field names never contain
spaces, and are always lowercase with underbars between words.</p>

<h4>Example: Filtering out pages or directories</h4>
<p>The host filter above can be modified slightly to filter out entries based on any field. One common example is if you want to filter out hits on particular pages, for instance to discard hits from worm attacks.  A filter like this:</p>
<pre>  if (starts_with(page, "/default.ida?")) then "reject"</pre>
<p>rejects all hits on /index.ida, which eliminates many of the hits from the Code Red worm.</p>
<p>A filter like this:</p>
<pre>  if (!starts_with(page, "/directory1/")) then "reject"</pre>
  then continue on to the next filter</pre>
<p>rejects all hits <i>except</i> those on /directory1/, which can be 
useful if you want to create a database which focuses on only 
one directory (sometimes useful for ISPs).</p>

<h4>Example: Filtering out events before a particular date range</h4>
<p>The following filter rejects entries before 2004:</p>
<pre>  if (date_time_to_epoc(date_time) < date_time_to_epoc('01/Jan/2004 00:00:00')) then "reject"</pre>

<h4>Example: Filtering out events older than a particular age</h4>
<p>The following filter rejects entries older than 30 days (60*60*24*30 is the number of seconds in 30 days):</p>
<pre>  if (date_time < (now() - 60*60*24*30)) then "reject"</pre>

<h4>Example: Filtering out events outside a particular date range</h4>
<p>The following filter rejects all entries except those in 2003:</p>
<pre>  if ((date_time < '01/Jan/2003 00:00:00') or (date_time >= '01/Jan/2004 00:00:00')) then "reject"</pre>

<h4>Advanced Example: Converting the page field to strip off parameters</h4>
<p>The parameters on the page field (the part after the ?) are often of little value, and increase the size of the database substantially. Because of that, $PRODUCT_NAME includes a default filter that strips off everything after the ? in a page field (hint: if you need the parameters, delete the filter). $PRODUCT_NAME uses a special "replace everything after" filter for this use, but for the purpose of this example, here's another filter that does the same thing (but slower, because pattern matching is a fairly slow operation):</p>
<pre>  if (contains(page, "?")) then 
    if (matches_regular_expression(page, "^(.*?).*\$")) then 
      page = \$1 . "(parameters)"</pre>
<p>This checks if the page contains a question mark; if it does, it matches the page to a regular expression with a parenthesized subexpression which is set to just the part before and including the question mark.  The variable __HexEsc__241 is set automatically to the first parenthesized section, and this variable is used to set the page field to the part before and including the question mark, with "(parameters)" appended. For example, if the original value was /index.html?param1+param2, the result will be /index.html?(parameters).  That is exactly what we wanted--the parameters have been stripped off, so all hits on index.html with parameters will have the same value, regardless of the parameters--and that will reduce the size of the database.</p>

<p>The filters look the same in profile files, so you can also edit a filter in the profile file using a text editor.  You will need to use a backslash (\\) to escape quotes, dollar signs, backslashes, and other special characters if you edit the profile file directly.</p>

{= end_docs_page() =}