Newsletters



Sawmill Newsletter

  January 15, 2007



Welcome to the Sawmill Newsletter!


You're receiving this newsletter because during the downloading of Sawmill, you checked the box to join our mailing list. If you wish to be removed from this list, please send an email, with the subject line of "UNSUBSCRIBE" to newsletter@sawmill.net .


News

This issue of the Sawmill Newsletter discusses method of filtering spider traffic out of web log reports.

We are currently shipping Sawmill 7.2.8. You can get it from http://sawmill.net/download.html .



Tips & Techniques:
Ignoring Spider Traffic

Spiders (also called robots) are computer programs on the Internet which access web sites as though they were web browsers, automatically reading the contents of each page they encounter, extracting all links from that page, and then following those links to find new pages. Each spider has a specific purpose, which determines what they do with the page: some spiders collect page contents for search engines; other spiders search the web to collect images or sound clips, etc.

Depending on your purpose for analyzing web logs, you may with to include the spider hits in your reports, or exclude them. You would include them:

  * If you want to show technical metrics of the server, like total hits or total bandwidth transferred, to determine server load or
  * If you want to determine which spiders are hitting the site, and how often, and what pages they are hitting, to determine search engine coverage.

You would exclude them:

  * If you want to show only human traffic, to get a better idea of the number of humans viewing the site.

By default, Sawmill includes spider traffic in all reports. But if you don't want it, you can exclude it using either a log filter, or a report filter. You would use a log filter if you never want to see spider traffic in reports; you would use a report filter if you sometimes want to see spider traffic, and sometimes not.


Rejecting Spider Traffic Using a Log Filter

Log filters affect the log data as it is processed by Sawmill. In this section, we will show how to create a filter which rejects each spider hit as it is processed, so it is not added to the Sawmill database at all, so it does not ever appear in any report.

  1. Go to the Admin page of the Sawmill web interface (the front page; click Admin in the upper right if you're not already there).

  2. Click View Config in the profiles list, next to the profile for which you want to reject spiders.

  3. Click the Log Filters link, in the Log Data group of the left menu.

  4. Click "New Log Filter" in the upper right of the Log Filters list.

  5. Enter "Reject Spiders" in the Name field.

  6. Click the Filter tab.

  7. Click New Condition; choose Spider as the Log field, "is NOT equal" as the Operator, and enter "(not a spider)" as the Value, and click OK:

        New Condition


  8. Click New Action; choose "Reject log entry" as the Action, and click OK:

       Action Window


  9. Click Save and Close to save the completed Log Filter:


        Reject Spiders


  10. Now, rebuild the database, and view reports, and the spiders will be gone from the reports.

If you wanted to show only spiders (ignoring human visitors), you could use "is equal" in step 7, above.


Rejecting Spider Traffic Using a Report Filter

Report filters affect the reports, by excluding some events in the database from affecting the reports. They slow down the reports somewhat, so if you're sure you'll never want to see spider traffic, a Log Filter is a better option (see above). But if you want to be able to turn spider traffic on and off without rebuilding the database, adding a Report Filter is the best choice.

Here's how:

  1. Click the Filters icon at the top of the report page.

  2. Click "Add New Filter Item" in the Spider section of the Filters page.

  3. Enter "(not a spider)" in the field:

        Report Filter Window


  4. Click OK.

  5. Click Save And Close.

The report will immediately display again, this time without spider traffic.

Note that by clicking the "is not" button in the Spider section of the Filters page, above, you can show only spider traffic, instead of showing only non-spider traffic.


Defining Your Own Spiders

The file spiders.cfg, in the LogAnalysisInfo folder, contains definitions of all spiders known to Sawmill. During log processing, Sawmill compares the User-Agent field of each hit to the substring values of the records in this file; if the User-Agent field contains the substring of a spider listed in spiders.cfg, the "label" value of that record is used as the name of the spider in the reports. You can add your own records to spiders.cfg, if you know of spiders which are not defined there. For instance, adding this to spiders.cfg:


  somespider = {
    label = "Some Spider"
    substring = "SomeSpider"
  }


(on the line after the "spiders = {" line) will cause any hit where the User-Agent contains "SomeSpider" to be counted as a hit from a spider called "Some Spider"; "Some Spider" will appear in the Spiders report.

For better performance, many of the records in spiders.cfg are commented out, with a leading # character. Removing this character will uncomment those spiders, to allow Sawmill to recognize them (but will also slow log processing).


Advanced Techniques

The method above will identify well-behaved spiders, but it will not work for spiders which do not announce themselves as spiders using their User-Agent header. It is difficult to identify these spiders, but there are several advanced methods which get close. One option is to look for hits on the file /robots.txt (a file hit by most spiders), and count all future hits from those IP addresses as spider hits. If the spider doesn't even hits /robots.txt, another possible approach is to look for IPs which do not hit CSS or JS files, which indicates that they are farming HTML data, but not rendering it, a strong indication that they are spiders. These topics are discussed in the Using Log Filters chapter of the Sawmill documentation.




[Article revision v1.4]