Hierarchies and Fields


Sawmill can break down log information on many fields. For instance, if it is processing a web server access log, it can break it down by page, date, host, and many others. Each field type is organized hierarchically, in a tree-like structure of items and subitems. For example, a section of a typical "page" hierarchy might look like this:

 
                          / (the root) 
                       /                \\ 
                 /dir1/                 /dir2/ 
                /      \\                     \\ 
  /dir1/file1.html   /dir1/file2.html      /dir2/dir3/ 
                                          /          \\ 
                         /dir2/dir3/file3.html   /dir2/dir3/file4.html 

This is hierarchical; some pages are above others because they are within them. For instance, a directory is above all of its subdirectories and all of the files in it.

This hierarchical structure allows you to "zoom" into your log data one level at a time. Every report you see in a web log analysis corresponds to one of the items above. Initially, Sawmill shows the page corresponding to /, which is the top-level directory (the "root" of the hierarchical "tree"). This page will show you all subpages of /; you will see /dir1/ and /dir2/. Clicking on either of those items ("subitems" of the current page) will show you a new page corresponding to the subitem you clicked. For instance, clicking on /dir2/ will show you a new page corresponding to /dir2/; it will contain /dir2/dir3/, the single subitem of /dir2/. Clicking on /dir2/dir3/ will show a page corresponding to /dir2/dir3/, containing the subitems of /dir2/dir3/: /dir2/dir3/file3.html and /dir2/dir3/file4.html.

Sawmill shows the number of page views (and/or bytes transferred, and/or visitors) for each item it displays. For instance, next to /dir2/ on a statistics page, you might see 1500, indicating that there were 1500 page views on /dir2/ or its subitems. That is, the sum of the number of page accesses on /dir2/, /dir2/dir3/, /dir2/dir3/file4.html, /dir2/dir3/file3.html is 1500. That could be caused by 1500 page views on /dir2/dir3/file4.html and no page views anywhere else, or by 1000 page views on /dir2/dir3/file3.html and 500 page views directly on /dir2/, or some other combination. To see exactly which pages were hits to create those 1500 page views, you can zoom in by clicking /dir2/.

There are many other hierarchies besides the page hierarchy described above. For instance, there is the date/time hierarchy, which might look like this:

 
                       (the root)
                    /             \\ 
                2003               2004
               /    \\              /   \\ 
        Nov/2003  Dec/2003   Jan/2004  Feb/2004

The date/time hierarchy continues downward similarly, with days as subitems of months, hours of days, minutes of hours, and seconds of minutes. Other hierarchies include the URL hierarchy (similar to the page hierarchy, with http:// below the root, http://www.flowerfire.com/ below http://, etc.), the hostname hierarchy (.com below the root, flowerfire.com below .com, www.flowerfire.com below flowerfire.com, etc.), and the location hierarchy (countries/regions/cities).

Some terminology: the top very top of a hierarchy is called the "root" (e.g. "(the root)" in the date/time hierarchy above, or "/" in the page hierarchy). An item below another item in the hierarchy is called a "subitem" of that item (e.g. 2004 is a subitem of the root, and /dir2/dir3/file4.html is a subitem of /dir2/dir3/). Any item at the very bottom of a hierarchy (an item with no subitems) is called a "leaf" or a "bottom level item" (e.g. Jan/2004 and /dir1/file1.html are leaves in the above hierarchies). An item which has subitems, but is not the root, is called an "interior node" (e.g. 2004 and /dir2/dir3/ are interior nodes in the above hierarchies).

To save database size, processing time, and browsing time, it is often useful to "prune" the hierarchies. This is done using the Suppress Levels Below and Suppress Levels Above parameters of the database field options. Levels are numbered from 0 (the root) downward; for instance 2003 is at level 1 above, and /dir2/dir3/file4.html is at level 3. Sawmill omits all items from the database hierarchy whose level number is greater than the Suppress value. For instance, with a Suppress value of 1 for the date/time hierarchy above, Sawmill would omit all items at levels 2 and below, resulting in this simplified hierarchy in the database:

 
                    (the root)
                    /        \\ 
                2003          2004

Using this hierarchy instead of the original saves space and time, but makes it impossible to get date/time information at the month level; you won't be able to click on 2003 to get month information. Sawmill also omits all items from the hierarchy whose level number is less than or equal to the Collapse value (except the root, which is always present). For instance, with a Collapse value of 1 (and Suppress of 2), Sawmill would omit all level 1 items, resulting in this hierarchy:

 
                       (the root)
                 /       |    |        \\ 
        Nov/2003   Dec/2003  Jan/2004  Feb/2004

All four of the level 2 items are now direct subitems of the root, so the statistics page for this hierarchy will show all four months. This is useful not just because it saves time and space, but also because it combines information on a single page that otherwise would have taken several clicks to access.

Here's an example of "Suppress Levels Above" and "Suppress Levels Below," based on the page field value /dir1/dir2/dir3/page.html. With above=0 and below=999999, this will be marked as a single hit on the following items:

With above=2, below=999999, all levels above 2 are omitted (the root level, 0, is always included):

With above=0, below=2, all levels below 2 are omitted:

above=2, below=3: all levels above 2 are omitted (except 0), and all levels below 3 are omitted:

In the last example, zooming in on / will show you /dir1/dir2/ (you will never see /dir1/ in the statistics, because level 1 has been omitted); zooming in on that will show you /dir1/dir2/dir3/, and you will not be able to zoom any further, because level 4 has been omitted.

On a side note, the "show only bottom level items" option in the Table Options menu provides a dynamic way of doing roughly the same thing as using a high value of Collapse. Using the Options menu to show only bottom level items will dynamically restructure the hierarchy to omit all interior nodes. In the case of the page hierarchy above, that would result in the following hierarchy:

 
                               / (the root) 
                /               |        |                  \\ 
  /dir1/file1.html  dir1/file2.html  /dir2/dir3/file3.html  /dir2/dir3/file4.html 

This hierarchy has all leaf items as direct subitems of the root, so the statistics page for this hierarchy will show all four pages. This is much faster than using Suppress because after Suppress has been modified, the entire database must be rebuilt to reflect the change.

The Database Structure Options provide a couple other ways of pruning your hierarchies. The "Always include bottom-level items" option, forces Sawmill to include the bottom-level items in the hierarchy, regardless of the setting of the Suppress value. It is useful to include the bottom-level items if you need them for some feature of Sawmill (for instance, visitor information requires that all the bottom-level hosts be present, and session information is most meaningful if all the bottom-level date/times are present), but you want to prune some of the interior of the hierarchy to save space.

Hierarchies can be either left-to-right (with the left part of the item "enclosing" the right part, as in /dir1/dir2/file.html, where /dir1/ encloses /dir1/dir2/, which encloses /dir1/dir2/file.html), or right-to-left (with the right part of the item enclosing the left part, as in www.flowerfire.com, where .com encloses flowerfire.com, which encloses www.flowerfire.com). Hierarchies use one or more special characters to divide levels (e.g. / in the page hierarchy or . in the host hierarchy). These options and more can be set in the log field options.

Some log fields are not hierarchical, for instance the operation field of a web log (which can be GET, POST, or others), or integer fields like the size or response fields. These fields are specified in the log field options as non-hierarchical, or "flat", and all items in those hierarchies appear directly below the root.