Log Fields
Log fields are containers which hold particular values in the log data, or which act as variables to hold other values. In general, each log entry contains multiple fields, and Sawmill extracts those field values from the log data and populates it into the log fields. The log fields are then processed by log filters, and if the entry is accepted by the filter, they are then copied to database fields to be included into the database. The database fields are then used to generate reports.
For instance, if a log file contains three comma-separated fields per line: date, time, and page,
then the log fields would be date
, time
, and page
and any derived fields, like hour_of_day; see below. Log fields can have any names, and the log fields
in a profile depends on the log format.
A log field may be either an actual field or a derived field, which is present in each log entry of the file. For instance, the "page" field or the "hostname" fields are actual fields, and a derived field which is not present in log entries, but is derived from the entries which are present, and from other information. Derived fields include fields like "domain description," which is a textual description of the host domain (e.g. "France" for .fr), and is derived from the hostname field, or the "day of week" field, which is derived from the date/time field, or the "operating system" field, which is derived from the user-agent field. Derived fields are present when the fields they get their value from are present; they are created automatically when their source fields are created. See below for more information about specific derived fields.
For actual fields which are not derived, you can specify a number of parameters. Here is an example node describing a "page" field of a web log.
page = { type = "page" index = "5" subindex = "2" hierarchy_dividers = "/" leading_divider = true left_to_right = true label = "$lang_stats.field_labels.page" } # page
The possible parameters (subnodes) are as follows:
-
name
: The name of this field. This is the name of the node which describes the field; i.e., it is the part before the first equal sign (=). In the example above, this ispage
. This is the internal name, which is used in command lines and advanced expressions. -
label
: The label of this field. This is used to refer to the field in the web interface, in reports, the Log Filter Editor, and more. -
type
: The type of the log field. The type describes the format of the field, and sometimes also describes the purpose of the field. Allowable field types are:page
: The "page hit" field of a web log, or any field in a /-divided pathname format; e.g. /mypages/dir/file.html. This also acts like a "hierarchical" field.host
: The browsing hostname, or any field in hostname format; e.g. my.host.com. This also acts like a "hierarchical" field. The domain description, country, region, city, location, and organization fields are derived from this field.url
: Any field in URL format (e.g. http://hostname/page.html). The search phrase field is derived from this field.date_time
: A combined date/time field. The format of this field depends on the setting of Date format and Time format. If this field is not present in the log, butdate
andtime
are, this field will be derived from those fields. The day of week, hour of day, week of year, and day of year fields are derived from this field.date
: A year, month, and day, date field. The format of this field depends on the setting of Date format.time
: A time of day; hour, minute, and second. The format of this field depends on the setting of Time format.agent
: An agent, or browser field. The browser OS, browser type, and browser version fields are derived from this field. This also acts like a "flat" field.size
: The size field as in bytes transferred. This field will be used to compute bandwidth information. This also acts like an "integer" field. The size range field is derived from this field.integer
: Any field whose value is an integer (e.g. 67). This also acts like a "flat" field.response
: The "server response" field, containing the numeric HTTP server response code (e.g. 200, 404).hierarchical
: Any field which is multi-level hierarchical. The hierarchy divider and other parameters can be specified below. See Hierarchies and Fields.flat
: Any field which is hierarchically flat; all fields are directly below the root of the hierarchy. See Hierarchies and Fields.
-
index
: This specifies the index of this log field in the log data. For instance, if this is the first log field in the entry, this should be 1. If this is the fifth log field in the line, it should be 5. This can be left 0 if the log field is being filled in by a parsing regular expression or by parsing filters. -
subindex
: This specifies the subindex of this log field in the log data. This can usually be left at 0. A subindex is required only when the log field is contained inside another quoted field. In that case, the position of the quoted field is specified using the index, and the subindex indicates the position within the quoted field, by a space-separated subfield. This can be left 0 if the log field is being filled in by a parsing regular expression or by parsing filters. -
hierarchy_dividers
: This specifies the character(s) which divide hierarchy levels in this field, if this field is hierarchical. Up to three characters may be specified. For instance, in a standard "page" field (e.g. /one/sample/page.html), the divider would be / or /? or /?& . Include the ? if you want the page field to be split on the URL parameters divider; include the & if you want it split between parameters. See Hierarchies and Fields. -
left_to_right
: This specifies whether the hierarchy is left-to-right, i.e. with the higher hierarchy levels at the left, like /one/sample/page.html, or right-to-left, i.e. with the higher hierarchy level at the right, like some.hostname.com. See Hierarchies and Fields. The value istrue
orfalse
. -
leading_divider
: This specifies whether the field has a hierarchy divider at its highest end, e.g. /one/sample/page.html, which starts with /, or not. e.g. one/sample/page.html or some.hostname.com, which have hierarchy dividers only inside the field. The value istrue
orfalse
. -
case_sensitive
: This controls whether the field is case-sensitive. When this is false, Sawmill treats items as the same if they differ only in the case (uppercase/lowercase); for instance, index.html and Index.html are treated as the same page. When this option is true, the case of the item as it appears in statistics is determined by the case of the first item Sawmill sees while processing the log data. So if the first hit is on index.html and the second is on Index.html, it will appear in the statistics as two hits on index.html. So in the example above, the statistics would list one hit on index.html, and one hit on Index.html. The value istrue
orfalse
.
Derived Fields
Derived fields are computed log fields from real log fields that are parsed through an algorithm. The logs that are parsed can then be used in database fields. Possible derived fields are:
-
domain_description
. This field is derived from the hostname field. It is a textual description of the top-level domain of the hostname. For instance, the host myhost.fr has a domain description of "France (.fr)," and myhost.edu has a domain description of "Educational (.edu)." -
country
. This field is derived from the hostname field. It is the country of the IP/hostname (e.g. France). This is computed using the Geo-IP database. -
region
. This field is derived from the hostname field. It is the region of the IP/hostname (e.g. Illinois). This is computed using the Geo-IP database. -
city
. This field is derived from the hostname field. It is the city of the IP/hostname (e.g. Moscow). This is computed using the Geo-IP database. -
location
. This field is derived from the hostname field. It is the country, region, and city (see above), divided by slashes. This is computed using the Geo-IP database. -
organization
. This field is derived from the hostname field. It is the organization which owns or manages the hostname (IP address). This is computed using the GeoIP Organization database, if installed. GeoIP Organization is not included by default; it must be purchased from MaxMind, and installed in the LogAnalysisInfo directory (it must be the binary version, and named GeoIPOrg.dat). -
isp
. This field is derived from the hostname field. It is the Internet Service Provider (ISP) which provides the Internet connection for the hostname (IP address). This is computed using the GeoIP ISP database, if installed. GeoIP ISP is not included by default; it must be purchased from MaxMind, and installed in the LogAnalysisInfo directory (it must be the binary version, and named GeoIPISP.dat). -
domain
. This field is derived from the hostname field. It is the domain of the hostname (IP address), as computed using the GeoIP Domain database, if installed. GeoIP Domain is not included by default; it must be purchased from MaxMind, and installed in the LogAnalysisInfo directory (it must be the binary version, and named GeoIPDomain.dat). -
referrer_domain
. This field is derived from the referrer field. It is a textual description of top-level domain of the referrer. For instance, the referrer http://www.myhost.fr/index.html has a domain description of "France (.fr)," and http://myhost.edu/ has a domain description of "Educational (.edu)." -
size_range
. This field is derived from the size field. It is the power-of-ten range into which the size of the transferred object falls. For instance if an HTML page is transferred which is 6k in size, this field will be "1k - 10k." If a 17k page is transferred, this will be "10k - 100k." -
operating_system
. This field is derived from the agent field. This is the operating system used by the browsing user, e.g. "Windows ME." -
web_browser
. This field is derived from the agent field. This is the web browser type and version used by the visitor. For instance, if the visitor browsed your web site with Netscape 5.0, this will be "Netscape 5.0." This field is hierarchical, so the top level will show just "Netscape", and clicking it will show the version and minor version numbers. -
search_engine
. This field is derived from the referrer field. It shows the search engine used by the visitor to find your site. For instance, if the user searched in Google for "noodle recipes," and found your site, the value of this field will be "Google." Most hits are not the direct result of clicking in a web search engine's list page, so for most hits, this field will be empty. Sawmill determines which search engine contributes which hit by comparing the referrer field with the values listed in the LogAnalysisInfo/SearchEngines file, so you can modify that file if you want to add support for new search engines. -
search_phrase
. This field is derived from the referrer field. It shows the search phrase used in web search engines by the visitor who contributed to this hit. For instance, if the user searched in AltaVista for "noodle recipes," and found your site, the value of this field will be "noodle recipes." Most hits are not the direct result of clicking in a web search engine's list page, so for most hits, this field will be empty. -
log_filename
. This field is unique in that it is not derived from any other field. Instead, it is derived from the name of the log file which contains the current log entry. For instance, all entries in the log file "/var/logs/httpd/access_log" will have "/var/logs/httpd/access_log" as the value of this field (pathname format may be different for your platform; see Pathnames). -
date_time
. This is the only field which may be either a real (non-derived) log field or a derived log field. In some log formats, the date and time are specified together in a single field (see Date format and Time format); in those log formats, the date/time field is a single non-derived log field. In other log formats, the date and time fields are separate; in those formats, the date/time field is derived from the date and time fields. In a database, you should use the date/time field, rather than the separate date or time fields, to take full advantage of the date/time hierarchy. -
day_of_week
. This field is derived from the date/time field, which may in turn be derived from the date and time fields. It is the day of week corresponding to the date/time. For instance, a date/time of "03/Mar/1999 09:34:56" would have a day of week of "Monday." -
hour_of_day
. This field is derived from the date/time field, which may in turn be derived from the date and time fields. It is the hour of day corresponding to the date/time. For instance, a date/time of "03/Mar/1999 09:34:56" would have a day of week of "9:00 AM - 10:00 AM." -
day_of_year
. This field is derived from the date/time field, which may also be derived from the date and time fields. It is the number of the day of the year. For instance, January 1 is "1", January 20 is "20", and February 10 is "41". -
week_of_year
. This field is derived from the date/time field, which may in turn be derived from the date and time fields. It is the number of the week of the year. For instance, for hits on the first seven days of the year (January 1 - 7, inclusive) this field will be "1". For hits on the second week of the year (January 8 - 14) this field will be "2", and so on. -
worm
. This field is derived from the page field. It shows the name of the worm for each hit, or "(not a worm)" if the hit was not a worm hit. Worms are programs that attempt to infect other computers, usually through exploiting vulnerable versions of servers like web servers. Worms are detected using the LogAnalysisInfo/Worms file, so you can edit this file if you want to add detection of additional worms. -
file_type
. This field is derived from the page field. It shows the extension of the filename, e.g. HTML, HTM, GIF, PDF, etc. -
spider
. This field is derived from the agent field. It shows the name of the spider for each hit, or "(not a spider)" if the hit was not a spider hit. Spiders are programs that "walk" from page to page on the Internet, reading each page and doing something with it; for instance, search engines use spiders to build their databases of Internet pages. Spiders are detected using the LogAnalysisInfo/Spiders file, so you can edit this file if you want to add detection of additional spiders.
Editing Log Fields
To edit the log fields, open the profile .cfg file, within LogAnalysisInfo/profiles, using a text editor,
and search for "log = {
"; then search forward from there for "fields = {
".
Each log field is a separate bracketed group under the fields
group of the log
group,
and each log field lists the parameters described above. Edit the field, and save the file,
and your changes will take effect on the next database rebuild.