Querying the SQL Database Directly

Sawmill can use a SQL database as its back-end database, storing the parsed data in that database, and querying the database to generate its reports. You can query that database directly too, from an external program or script. This chapter provides an introduction to the structure of the Sawmill SQL database, to help you get started with your own queries.

The main table of the database is called main_table. This table contains one line per accepted database entry, and one column per database field. For instance, if you have 1 million lines of web log data, this table will have 1 million rows, with one column for each web log field (date/time, hostname, page, etc.).

Each value in the main_table table is a number. In the case of numerical fields (like hits, events, bandwidth, duration, etc.), the value will be the value of that field for the corresponding entry; for instance, a 1234-byte web log transfer would have 1234 in the bytes field.

The non-numerical fields of main_table are normalized, so for each field fieldname, there is a table called fieldnameitemnum (the field name, plus itemnum), which lists all the item numbers and item values for that field. The values in the main_table table correspond to the itemnums in the itemnums table. For instance, if "GIF" is itemnum 3 in the file_typeitemnum table, then a value of 3 in the file_type column of the main_table table indicates that the file type is "GIF".

By joining main_table to one or more itemnum tables, almost all standard queries can be performed. For instance, this query:

  select count(page_views) as page_views, i.hostname
    from main_table l
    left join hostnameitemnum i on l.hostname = i.itemnum
    group by i.hostname
    order by page_views desc limit 10

generates a "top ten" report for the hostname field, showing page views per hostname, sorted descending. You can replace hostname with the internal name of any other field to get that top-ten report. You can change "limit 10" to "limit N" to show N rows, or remove it to show all rows. You can change page_views to the internal name of any other numerical database field to show that instead.

Any query can be executed from the mysql command line like this:

  echo "query" | mysql -u username -h hostname -ppassword databasename > queryresult.tsv

This will run the query query, and generate the result of the query in the file queryresult.tsv.

The fieldnamesubitem tables describes the hierarchy of non-numerical fields by listing the direct descendents of each item. You might use this to get all months in a particular year, or all files and directories in a particular directory, or all cities in a particular state. Each row describes one item/subitem pair (using normalized numbers derived from the itemnum tables, as above). Item number 1 is the root (and the corresponding item is the empty string).

The fieldnamebottomlevelitem tables provide immediate mapping from an item to all its bottom-level items. For instance, you could use this to get all days in a particular year. Each row describes one item/bottomlevelitem pair (using normalized numbers derived from the itemnum tables, as above).

The xref tables provide fast access to common reports; they precompute the aggregated numerical values for field values. For instance, a cross-reference table for the file_type field would have one row for each file type, with the totals for that field, e.g., file_type=GIF, bytes=11235567. This table can be queried directly (much faster than querying main_table) to get "top ten" reports for any field, or combination of fields, which have xref tables. When there are "unique" numerical fields (e.g., visitors, for web logs), they can be computed from the tables with names like xref0visitors; this table lists the unique IPs (visitor IPs) for each itemnum combination which appears in the corresponding xref0 table.