FAQ: Are the Statistics Accurate?


I've heard that statistics like visitors, "sessions," and "paths through the site" can't be computed accurately. Is that true? Are the statistics reported by Sawmill an accurate description of the actual traffic on my site?

Short Answer

Sawmill accurately reports the data as it appears in the log file. However, many factors skew the data in the log file. The statistics are still useful, and the skew can be minimized through server configuration.

Long Answer

Sawmill (and all other log analysis tools) reports statistics based on the contents of the log files. With many types of servers, the log files accurately describe the traffic on the server (i.e. each file or page viewed by a visitor is shown in the log data), but web log files are trickier, due to the effects of caches, proxies, and dynamic IP addresses.

Caches are locations outside of the web server where previously-viewed pages or files are stored, to be accessed quickly in the future. Most web browsers have caches, so if you view a page and then return in the future, your browser will display the page without contacting the web server, so you'll see the page but the server will not log your access. Other types of caches save data for entire organizations or networks. These caches make it difficult to track traffic, because many views of pages are not logged and cannot be reported by log analysis tools.

Caches interfere with all statistics, so unless you've defeated the cache in some way (see below), your web server statistics will not represent the actual viewings of the site. The logs are, however, the best information available in this case, and the statistics are far from useless. Caching means that none of the numbers you see are accurate representations of the number of pages actually views, bytes transferred, etc. However, you can be reasonably sure that if your traffic doubles, your web stats will double too. Put another way, web log analysis is a very good way of determining the relative performance of your web site, both to other web sites and to itself over time. This is usually the most important thing, anyway-- since nobody can really measure true "hits," when you're comparing your hits to someone else hits, both are affected by the caching issues, so in general you can compare them successfully.

If you really need completely accurate statistics, there are ways of defeating caches. There are headers you can send which tell the cache not to cache your pages, which usually work, but are ignore by some caches. A better solution is to add a random tag to every page, so instead of loading /index.html, they load /index.html?XASFKHAFIAJHDFS. That will prevent the page from getting cached anywhere down the line, which will give you complete accurate page counts (and paths through the site). For instance, if someone goes back to a page earlier in their path, it will have a different tag the second time, and will be reloaded from the server, relogged, and your path statistics will be accurate. However, by disabling caching, you're also defeating the point of caching, which is performance optimization-- so your web site will be slower if you do this. Many choose to do it anyway, at least for brief intervals, in order to get "true" statistics.

The other half of the problem is dynamic IP addresses, and proxies. This affects the "visitor" counts, in those cases where visitors are computed based on the unique hosts. Normally, Sawmill assumes that each unique originating hostname or IP is a unique visitor, but this is not generally true. A single visitor can show up as multiple IP addresses if they are routed through several proxy servers, or if they disconnect and dial back in, and are assigned a new IP address. Multiple visitors can also show up as a single IP address if they all use the same proxy server. Because of these factors, the visitor numbers (and the session numbers, which depend on them) are not particularly accurate unless visitor cookies are used (see below). Again, however, it's a reasonable number to throw around as the "best available approximate" of the visitors, and these numbers tend to go up when your traffic goes up, so they can be used as effective comparative numbers.

As with caching, the unique hosts issue can be solved through web server profile. Many people use visitor cookies (a browser cookie assigned to each unique visitor, and unique to them forever) to track visitors and sessions accurately. Sawmill can be configured to use these visitor cookie as the visitor ID, by extracting the cookie using a log filter, and putting it in the "visitor id" field. This isn't as foolproof as the cache-fooling method above, because some people have cookies disabled, but most have them enabled, so visitor cookies usually provide a very good approximation of the true visitors. If you get really tricky you can configure Sawmill and/or your server to use the cookie when it's available, and the IP address when it's not (or even the true originating IP address, if the proxy passes it). Better yet, you can use the concatenation of the IP address and the user-agent field to get even closer to a unique visitor id even in cases where cookies are not available. So you can get pretty close to accurate visitor information if you really want to.

To summarize, with a default setup (caching allowed, no visitor cookies), Sawmill will report hits and page views based on the log data, which will not precisely represent the actual traffic to the site, and so will and any other log analysis tool. Sawmill goes further into the speculative realm than some tools by reporting visitors, sessions, and paths through the site. With some effort, your server can be configured to make these numbers fairly accurate. Even if you don't, however, you can still use this as valuable comparative statistics, to compare the growth of your site over time, or to compare one of your sites to another.