Cross-Referencing and Simultaneous Filters


Sawmill lets you "zoom in" using complex filters, for instance to break down the events on any particular day by page (in a web log, to see which pages were hit on that day), or to break down the events on any page by day (to see which days the page was accessed). Sawmill can be configured to allow this sort of cross-referencing between any or all fields in the database. This zooming ability is always available, but without cross-reference tables it must scan the entire main table to compute results, which can be slow for large datasets. Cross-reference tables provide "roll-ups" of common queries, so they can be computed quickly without reference to the main log data table.

Cross-references are not an enabling feature, as they were in earlier versions of Sawmill -- all reports are available, even if no cross-reference tables are defined. Cross-reference tables are an optimizing feature, which increase the speed of certain queries.

Another way of looking at this feature is in terms of filters; when two fields are cross-referenced against each other, Sawmill is able to apply filters quickly to both fields at the same time, without needing to access the main table.

If two fields are not cross-referenced against each other, Sawmill can apply filters to one field or the other quickly, but filtering both simultaneously will require a full table scan. If the page field is not cross-referenced against the date/time field, for instance, Sawmill can quickly show the number of hits on a /myfile.html, or the number of hits on Jun/2004, but not the number of hits on /myfile.html which occurred during Jun/2004 (which will require a full table scan). This means not only that Sawmill can't quickly show a page with filters applied to both fields in the Filters section, but also that Sawmill cannot quickly show "pages" report when there is a filter on the date/time field, or a "years/months/days" or "days" report when there is a filter on the page field, since the individual items in these views effectively use simultaneous filters to compute the number of hits.

On the other hand, cross-reference tables use space in the database. The more fields you cross-reference, the larger and slower your database gets. Restricting cross-referencing only to those fields that you really need cross-referenced is a good way to limit the size of your database, and speed browsing of the statistics.

Cross-references are set by default for each field, but no two non-aggregating fields are included in the same cross-reference group. This is a fairly minimal use of cross-references, but for faster database builds, you can delete those as well (at a cost in query speed, when the database main table needs to be queries because there is no cross-reference group available).

Generally, you should start out with few cross-references; a default analysis is a good starting point. If you need a type of information not available (for instance, if you want to know the browser versions that are accessing a particular page), and the report generates too slowly, try adding the necessary cross-references. See Memory, Disk, and Time Usage for more information on optimizing your memory, disk space, and processing time.

Again, cross-references are never necessary to generate a particular report -- they just make reports faster.