{= include("docs.util"); start_docs_page(docs.technical_manual.page_titles.newsletters); =}
![]() |
Sawmill Newsletter August 15, 2008 |
Sawmill Enterprise supports the use of a MySQL server as the
back-end database (this is available only with Enterprise--Sawmill
Professional and Sawmill Lite cannot use MySQL). [Note: Sawmill has
been certified against the commercial version of MySQL
server, and it is recommended that Sawmill users purchase the full
MySQL Server, rather
than using the "community" database server.] Sawmill's own built-in
"internal" database is faster than MySQL, and is therefore generally a
better choice if reports are to be generated by Sawmill. But MySQL has
several major advantages over the internal database, including the
ability to run SQL queries. This newsletter discusses the use of SQL to
extract information from a MySQL database which was created by Sawmill.
The techniques described in this newsletter can be used to run
arbitrary SQL queries against log data in any of the 800+ log formats
Sawmill supports; other log formats can be supported through custom log
format plug-ins.
Creating A MySQL Profile And Database
To create a Sawmill profile which uses MySQL as its back-end database, create a profile just as you would for the internal database, but on the Database Options page of the Create Profile Wizard, select "Use MySQL database" (again, this is available only in Sawmill Enterprise). Then enter the hostname, username, and password of your MySQL database server. If you're using a socket to communicate with MySQL, and the socket is not in the default location, you can also enter its pathname on this page. Finally, if you want the name of the database (schema) to be different from the name of the profile, you can enter a different database name on this page:
Querying The Main Table And The Itemnum Tables
The main table of the database is called "logfile," and contains one
row for each event in the log data. In this example, we are analyzing a
small 5000-line Apache log file, so each line of log data is a separate
event; so the resulting logfile contains 5000 rows (all examples below
are captures from the "mysql" command-line program. If your mail client
does not support the HTML tags used to render this in a mono-spaced
font, these may format poorly; changing the font to a mono-spaced font
like Courier or Monaco will show them better):
mysql> select count(*) from logfile; +----------+ | count(*) | +----------+ | 5000 | +----------+ |
mysql> select * from logfile limit 1; +-----------+---------------------+-----------------+-------------+-------------+----------+------+-----------+------+-------------------+--------------+----------+--------------------+----------+--------------+------+--------+----------+----------------------+---------------+---------------+-------------+------------------+--------+---------------+--------------------+-----------------+------+------------+---------+-------+--------+--------------+------------------+----------+------+ | loadorder | date_time | bottomleveldate | day_of_week | hour_of_day | hit_type | page | file_type | worm | screen_dimensions | screen_depth | hostname | domain_description | location | organization | isp | domain | referrer | referrer_description | search_engine | search_phrase | web_browser | operating_system | spider | server_domain | authenticated_user | server_response | hits | page_views | spiders | worms | errors | broken_links | screen_info_hits | visitors | size | +-----------+---------------------+-----------------+-------------+-------------+----------+------+-----------+------+-------------------+--------------+----------+--------------------+----------+--------------+------+--------+----------+----------------------+---------------+---------------+-------------+------------------+--------+---------------+--------------------+-----------------+------+------------+---------+-------+--------+--------------+------------------+----------+------+ | 1 | 1998-04-07 16:53:06 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 734 | +-----------+---------------------+-----------------+-------------+-------------+----------+------+-----------+------+-------------------+--------------+----------+--------------------+----------+--------------+------+--------+----------+----------------------+---------------+---------------+-------------+------------------+--------+---------------+--------------------+-----------------+------+------------+---------+-------+--------+--------------+------------------+----------+------+ |
mysql> select * from hostnameitemnum where itemnum = 2; +---------+----------------+ | itemnum | hostname | +---------+----------------+ | 2 | 140.177.203.25 | +---------+----------------+ |
By joining one or more itemnum tables to the main table, you can get
results similar to those shown in Sawmill's own reports. For instance,
let's generate a "top 10 hostnames" report. We can do that by
selecting/summing logfile, grouping on hostname, joining in
hostnameitemnum to get real hostnames in the result (rather than
normalized hostname itemnums), ordering by hits descending, and
limiting to the top 10:
mysql> select i.hostname, sum(hits) as hits, sum(page_views) as page_views, sum(size) as size from logfile l left join hostnameitemnum i on l.hostname = i.itemnum group by hostname order by hits desc limit 10; +-----------------------------+------+------------+----------+ | hostname | hits | page_views | size | +-----------------------------+------+------------+----------+ | lipowitz.isdn.uiuc.edu | 692 | 236 | 6348398 | | 192.17.19.150 | 466 | 46 | 6363739 | | 192.17.19.148 | 317 | 136 | 24279573 | | pale.kai.com | 308 | 70 | 5289858 | | flowerfire.isdn.uiuc.edu | 242 | 52 | 2337963 | | spider.unh.edu | 171 | 19 | 433709 | | 195.101.37.244 | 87 | 21 | 575602 | | 206.148.222.50 | 79 | 25 | 876475 | | isdn-5.nii.enterconnect.net | 79 | 16 | 579010 | | gli2302.ctea.com | 75 | 9 | 436318 | +-----------------------------+------+------------+----------+ |
Any of Sawmill's standard table reports can be generated similarly using SQL.
Filtering By Itemnum
Now suppose we want to filter the result. Any type of filter is possible using a WHERE clause in the SQL query. In this case, we'll filter the report above to show only .com hostnames. By including "where i.hostname like '%.com'", the query now selects only those rows from logfile which contain hostname values ending with .com, when de-normalized. So, the resulting list contains all .com addresses:
mysql> select i.hostname, sum(hits) as hits, sum(page_views) as page_views, sum(size) as size from logfile l left join hostnameitemnum i on l.hostname = i.itemnum where i.hostname like '%.com' group by hostname order by hits desc limit 10; +--------------------------------+------+------------+---------+ | hostname | hits | page_views | size | +--------------------------------+------+------------+---------+ | pale.kai.com | 308 | 70 | 5289858 | | gli2302.ctea.com | 75 | 9 | 436318 | | wat.thedj.com | 57 | 22 | 767059 | | tosainu.trimark.com | 45 | 12 | 341875 | | gianduia.compecon.com | 39 | 6 | 206124 | | ip252.ts4.phx.inficad.com | 34 | 4 | 160689 | | h-205-217-240-156.netscape.com | 33 | 4 | 173806 | | clarendon.weblogic.com | 32 | 4 | 151206 | | gw.vixel.com | 30 | 4 | 115755 | | cx51617-a.dnpt1.occa.home.com | 30 | 4 | 150578 | +--------------------------------+------+------------+---------+ |
The filter doesn't have to work on only the primary field of the query (hostname); it can work on any fields in logfile, or the joined fields of any other table. MySQL supports many joins in a single query, so we can join in one table for the main column (hostname), and also join additional tables for filtering. For instance, here are the hostnames which accessed GIF images (the hits column shows how many GIF accesses each hostname had; the size column shows how many GIF bytes each hostname transferred):
mysql> select i.hostname, sum(hits) as hits, sum(page_views) as page_views, sum(size) as size from logfile l left join hostnameitemnum i on l.hostname = i.itemnum left join file_typeitemnum fi on l.file_type = fi.itemnum where fi.file_type = 'GIF' group by hostname order by hits desc limit 10; +-----------------------------+------+------------+--------+ | hostname | hits | page_views | size | +-----------------------------+------+------------+--------+ | lipowitz.isdn.uiuc.edu | 432 | 0 | 447331 | | 192.17.19.150 | 420 | 0 | 209925 | | pale.kai.com | 235 | 0 | 144966 | | 192.17.19.148 | 161 | 0 | 322118 | | spider.unh.edu | 152 | 0 | 129566 | | flowerfire.isdn.uiuc.edu | 151 | 0 | 433550 | | gli2302.ctea.com | 66 | 0 | 56660 | | 195.101.37.244 | 65 | 0 | 87507 | | isdn-5.nii.enterconnect.net | 63 | 0 | 24508 | | 206.148.222.50 | 52 | 0 | 74631 | +-----------------------------+------+------------+--------+ 10 rows in set (0.03 sec) |