Querying the SQL Database Directly
Sawmill can use a SQL database as its back-end database, storing the parsed data in that database, and querying the database to generate its reports. You can query that database directly too, from an external program or script. This chapter provides an introduction to the structure of the Sawmill SQL database, to help you get started with your own queries.
The main table of the database is called main_table
. This table contains one line per accepted database entry,
and one column per database field. For instance, if you have 1 million lines of web log data, this table will have 1 million
rows, with one column for each web log field (date/time, hostname, page, etc.).
Each value in the main_table
table is a number. In the
case of numerical fields (like hits, events, bandwidth, duration, etc.), the value will be the value of that field for the
corresponding entry; for instance, a 1234-byte web log transfer would have 1234 in the bytes field.
The non-numerical fields of main_table
are normalized, so for each field fieldname
, there is a table called fieldnameitemnum
(the field name,
plus itemnum
), which lists all the item numbers and item values for that field. The values in the main_table
table correspond to the itemnums in the itemnums table. For instance, if "GIF" is itemnum 3 in the file_typeitemnum
table, then a value of 3 in the file_type
column of the main_table
table indicates that the file type is
"GIF".
By joining main_table
to one or more itemnum tables, almost all standard queries can be performed. For instance,
this query:
select count(page_views) as page_views, i.hostname from main_table l left join hostnameitemnum i on l.hostname = i.itemnum group by i.hostname order by page_views desc limit 10
generates a "top ten" report for the hostname field, showing page views per hostname, sorted descending. You can replace hostname with the internal name of any other field to get that top-ten report. You can change "limit 10" to "limit N" to show N rows, or remove it to show all rows. You can change page_views to the internal name of any other numerical database field to show that instead.
Any query can be executed from the mysql command line like this:
echo "query" | mysql -u username -h hostname -ppassword databasename > queryresult.tsv
This will run the query query, and generate the result of the query in the file queryresult.tsv.
The fieldnamesubitem
tables describes the hierarchy of non-numerical fields by listing the direct descendents of each
item. You might use this to get all months in a particular year, or all files and directories in a particular directory,
or all cities in a particular state. Each row describes
one item/subitem pair (using normalized numbers derived from the itemnum tables, as above). Item number 1 is the root
(and the corresponding item is the empty string).
The fieldnamebottomlevelitem
tables provide immediate mapping from an item to all its bottom-level items.
For instance, you could use this to get all days in a particular year. Each row describes one item/bottomlevelitem
pair (using normalized numbers derived from the itemnum tables, as above).
The xref
tables provide fast access to common reports; they precompute the aggregated numerical values for
field values. For instance, a cross-reference table for the file_type field would have one row for each file type,
with the totals for that field, e.g., file_type=GIF, bytes=11235567. This table can be queried directly (much faster
than querying main_table
) to get "top ten" reports for any field, or combination of fields, which have
xref tables. When there are "unique" numerical fields (e.g., visitors, for web logs), they can be computed from the tables
with names like xref0visitors
; this table lists the unique IPs (visitor IPs) for each itemnum combination
which appears in the corresponding xref0
table.