You're receiving this newsletter because during the
downloading of
Sawmill, you checked the box to join our mailing list. If you wish to
be removed from this list, please send an email, with the subject line
of "UNSUBSCRIBE" to newsletter@sawmill.net .
News
This issue of the Sawmill Newsletter discusses method of filtering
spider traffic out of web log reports.
We are currently shipping Sawmill 7.2.8. You can get it from
http://sawmill.net/download.html
.
Tips & Techniques: Ignoring Spider Traffic
Spiders (also called robots) are computer programs on the Internet
which access
web sites as though they were web browsers, automatically reading the
contents of each page they encounter, extracting all links from that
page, and then following those links to find new pages. Each spider has
a specific purpose, which determines what they do with the page: some
spiders collect page contents for search engines; other spiders search
the web to collect images or sound clips, etc.
Depending on your purpose for analyzing web logs, you may with to
include the spider hits in your reports, or exclude them. You would
include them:
* If you want to show technical metrics of the server, like total
hits or total bandwidth transferred, to determine server load or
* If you want to determine which spiders are hitting the site, and
how often, and what pages they are hitting, to determine search engine
coverage.
You would exclude them:
* If you want to show only human traffic, to get a better idea of
the number of humans viewing the site.
By default, Sawmill includes spider traffic in all reports. But if you
don't want it, you can exclude it using either a log filter, or a
report filter. You would use a log filter if you never want to
see spider traffic in reports; you would use a report filter if you
sometimes want to see spider traffic, and sometimes not.
Rejecting Spider Traffic Using a Log Filter
Log filters affect the log data as it is processed by Sawmill. In this
section, we will show how to create a filter which rejects each spider
hit as it is processed, so it is not added to the Sawmill database at
all, so it does not ever appear in any report.
1. Go to the Admin page of the Sawmill web interface (the front page;
click Admin in the upper right if you're not already there).
2. Click View Config in the profiles list, next to the profile for
which you want to
reject spiders.
3. Click the Log Filters link, in the Log Data group of the left
menu.
4. Click "New Log Filter" in the upper right of the Log Filters
list.
5. Enter "Reject Spiders" in the Name field.
6. Click the Filter tab.
7. Click New Condition; choose Spider as the Log field, "is NOT
equal" as
the Operator, and enter "(not a spider)" as the Value, and click OK:
8. Click New Action; choose "Reject log entry" as the Action, and
click OK:
9. Click Save and Close to save the completed Log Filter:
10. Now, rebuild the database, and view reports, and the spiders will
be gone from the reports.
If you wanted to show only spiders (ignoring human visitors),
you could use "is equal" in step 7, above.
Rejecting Spider Traffic Using a Report Filter
Report filters affect the reports, by excluding some events in the
database from affecting the reports. They slow down the reports
somewhat, so if you're sure you'll never want to see spider traffic, a
Log Filter is a better option (see above). But if you want to be able
to turn spider traffic on and off without rebuilding the database,
adding a Report Filter is the best choice.
Here's how:
1. Click the Filters icon at the top of the report page.
2. Click "Add New Filter Item" in the Spider section of the Filters
page.
3. Enter "(not a spider)" in the field:
4. Click OK.
5. Click Save And Close.
The report will immediately display again, this time without spider
traffic.
Note that by clicking the "is not" button in the Spider section of the
Filters page, above, you can
show only spider traffic, instead of showing only non-spider
traffic.
Defining Your Own Spiders
The file spiders.cfg, in the LogAnalysisInfo folder, contains
definitions of all spiders known to Sawmill. During log processing,
Sawmill compares the User-Agent field of each hit to the substring
values of the records in this file; if the User-Agent field contains
the substring of a spider listed in spiders.cfg, the "label" value of
that record is used as the name of the spider in the reports. You can
add your own records to spiders.cfg, if you know of spiders which are
not defined there. For instance, adding this to spiders.cfg:
(on the line after the "spiders = {" line) will cause any hit where the
User-Agent contains "SomeSpider" to be counted as a hit from a spider
called "Some Spider"; "Some Spider" will appear in the Spiders report.
For better performance, many of the records in spiders.cfg are
commented out, with a leading # character. Removing this character will
uncomment those spiders, to allow Sawmill to recognize them (but will
also slow log processing).
Advanced Techniques
The method above will identify well-behaved spiders, but it
will not work for spiders which do not announce themselves as spiders
using their User-Agent header. It is difficult to identify these
spiders, but there are several advanced methods which get close. One
option is to look for hits on the file /robots.txt (a file hit by most
spiders), and count all future hits from those IP addresses as spider
hits. If the spider doesn't even hits /robots.txt, another possible
approach is to look for IPs which do not hit CSS or JS files, which
indicates that they are farming HTML data, but not rendering it, a
strong indication that they are spiders. These topics are discussed in
the Using
Log Filters chapter of the Sawmill documentation.