7.2.7  
Download Now
 
 

Sawmill Discussion Forum

Subject: "Keep spiders out of Page Views" Archived thread - Read only
 
  Previous Topic | Next Topic
Printer-friendly copy     Email this topic to a friend    
Conferences Support Topic #1241
Reading Topic #1241
StupidScript
Member since Oct-31-03
10 posts
Nov-03-03, 03:07 PM (PDT)
Click to EMail StupidScript Click to send private message to StupidScript Click to view user profileClick to add this user to your buddy list  
"Keep spiders out of Page Views"
 
   (See previous note in "Worm or spider inflating page views")

I have tried using a Log Filter to detect a spider's agent using regexp {i.e.: agent regexp ^.*(crawler|robot|ia_archiver|etc).*$}, and instructing Sawmill to "Reject" matching entries. I also tried the same thing with "Count as hit" (not page view) instead of "Reject". Neither kept spider entries from being the top 50 or so entries in "page views".

Apparently worm entries are derived independently of the Log Filters. It almost looks like the logs are put through the filters, and then a second pass is made that enters the worm/spider log entries into the db, using the same process as regular log entries, but without the benefit of the filters.

The goal is to leave worm/spider statistics where they are so we can look at them, but to keep any log entry generated by a worm/spider from being included in Session data, so they don't skew "real" visitor data.

Thoughts? Thanks!


James


  Printer-friendly page | Top
StupidScript
Member since Oct-31-03
10 posts
Nov-04-03, 03:20 PM (PDT)
Click to EMail StupidScript Click to send private message to StupidScript Click to view user profileClick to add this user to your buddy list  
1. "RE: Keep spiders out of Page Views"
In response to message #0
 
   For those of you who are wondering: YES, I did also try the log filters to reject when "spider"!=(not a spider) and "worm"!=(not a worm). I placed them at the top of the filter list so they were acted on before anything else happened. The result was that no spider or worm reports were available (I _do_ want them) and still the hostnames of the spiders and worms dominate the Session/Page View/Paths results.

Still hoping for illumination about how to keep them from being treated as visitors while being able to view their stats independently. I am happy to write custom stuff and even edit source and recompile the binary (if I can get the source ).

To sum up:
YES want to see spider and worm stats
NO want to include them with "regular" visitor stats

Thanks for any help.
(BTW: I am already filtering out the types of worm attacks that Sawmill looks for at the server, and storing those log entries in a separate "hacker.log", so worms don't show up in my "transfer.log" anyway. It's the spiders that I wish to separate from the pack.)


James


  Printer-friendly page | Top
i21
Member since Mar-21-02
1731 posts
Nov-05-03, 02:03 AM (PDT)
Click to EMail i21 Click to view user profileClick to add this user to your buddy list  
2. "RE: Keep spiders out of Page Views"
In response to message #1
 
LAST EDITED ON Nov-05-03 AT 02:03 AM (PST)
 
Off the top of my head you *could* create filters that "(empty)" all the fields that you do not want Sawmill to track if the spider field is a spider?

It sounds a little messy though.

You could probably script this outside of Sawmill so the log line only contains the spider/worm info before Sawmill sees the log?

--
Graham


  Printer-friendly page | Top
StupidScript
Member since Oct-31-03
10 posts
Nov-05-03, 09:45 AM (PDT)
Click to EMail StupidScript Click to send private message to StupidScript Click to view user profileClick to add this user to your buddy list  
3. "RE: Keep spiders out of Page Views"
In response to message #2
 
   I guess that's the way to go, combined with your respone in "Worms and spiders inflating statistics".

First, I'll edit my current logs using a script to transfer the existing spider hits into a separate log file.

Then, in the same way I'm keeping hackers/worms in a secondary log file using Apache's CustomLog directive, I can put future spider hits into a secondary log file.

Both processes will compare the entries against a "spider" list, like Sawmill does.

I guess I can use some freebie log analysis program to look at the worm/spider data, or purchase the Sawmill 5-pack license to accomodate the three configurations.

I was kind of hoping that there was a method already in Sawmill that I could use. Maybe in a future release, there will be...? Maybe something like "if it can be identified as a spider or a worm, don't include it anywhere but spider or worm results"?

In this thread, I'll post the script for parsing out spiders from an Apache Combined Log and putting them in a separate file when I've got it, along with my "spiders" list.

Thank you for your help.


James


  Printer-friendly page | Top
i21
Member since Mar-21-02
1731 posts
Nov-05-03, 10:08 AM (PDT)
Click to EMail i21 Click to view user profileClick to add this user to your buddy list  
4. "RE: Keep spiders out of Page Views"
In response to message #3
 
No problem.

it's a good sugestion, post to the "Feature Request" forum and we will see what we can do.
--
Graham


  Printer-friendly page | Top
StupidScript
Member since Oct-31-03
10 posts
Nov-05-03, 02:58 PM (PDT)
Click to EMail StupidScript Click to send private message to StupidScript Click to view user profileClick to add this user to your buddy list  
5. "RE: Keep spiders out of Page Views"
In response to message #4
 
   Okay...here's my solution to splitting the spiders out of my transfer.log to keep them from confusing:

1) Modify Apache's httpd.conf to keep spiders from being logged in the same file as "normal" visitors. That gives you clean files so you won't have to do items 2a and 2b over and over again. Here are the httpd.conf directives I used:

# combined log format (4 lines: LogFormat,SetEnvIf,SetEnvIf,CustomLog)
LogFormat "%a %{SID}e %l %u %t \"%v%{Request_URI}e\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" jblog
SetEnvIf Request_URI
\.ida|\.exe|\.dll|OPTIONS|CONNECT|\.cfm|\/race\/|formmail|FORMMAIL|Formmail|MSOffice|\/ctruls\/|\"|<|>|XXXXX$ donotlog
SetEnvIf User-Agent AlkalineBOT|appie|<snip>|YahooSeeker|Yandex$ donotlog
CustomLog logs/transfer.log jblog env=!donotlog

# hacker log format (3 lines: LogFormat,SetEnvIf,CustomLog)
LogFormat "%a %{SID}e %l %u %t \"%r\" %>s %b \"%{Referer}i\"" hacklog
SetEnvIf Request_URI
\.ida|\.exe|\.dll|OPTIONS|CONNECT|\.cfm|\/race\/|formmail|FORMMAIL|Formmail|MSOffice|\/ctruls\/|\"|<|>|XXXXX$ dolog
CustomLog logs/hacker.log hacklog env=dolog

# spider log format (3 lines: LogFormat,SetEnvIf,CustomLog)
LogFormat "%a %{SID}e %l %u %t \"%r\" %>s %b \"%{Referer}i\"" spiderlog
SetEnvIf User-Agent AlkalineBOT|appie|<snip>|YahooSeeker|Yandex$ spidlog
CustomLog logs/spider.log spiderlog env=spidlog

2) Clean spider entries out of existing/historical logs. I chose a two-part process because this is a one-time thing for me. Apologies to non-'Nix admins...you'll need a different process:

a) Separate spider entries into their own log file:
egrep -i -f spider_list.txt -hh transfer.log > transfer.S.log

b) Separate non-spider entries to a fresh log file:
egrep -i -v -f spider_list.txt -hh transfer.log > transfer.C.log

You can get rid of the original, and use transfer.C.log (renamed) as your primary access log file. Remember to do step 1 first, though, otherwise you will need to do 2a and 2b again.

Here is a <snipped> version of "spider_list.txt", the file used by egrep to define the regular expression matching pattern. Note that this is for an Apache Combined Log Format and should be modified before applying it to any other log format:

^.*\..*\..*\..* .* .* .* \<.*\> \".*\" .* .* \".*\" \".*(AlkalineBOT|appie|<snip>|YahooSeeker|Yandex).*$

It's all one line, and has entries for all of the spiders in Sawmill's "Spiders" config file plus any others you need (see my list at the end of this message), separated by pipes ("or" operators in this expression).

Once I had clean log files, I simply rebuilt the Sawmill db (didn't change any filters), and voila..."real" visitors almost only. "Almost" because a couple of oddball spiders weren't on my list and squeaked through. Small matter.

Enjoy!

(Graham: I did post the suggestion to Feature Requests. Thanks!)

===== COMPLETE SPIDERS LIST I AM USING =====
AlkalineBOT appie Arachnophilia arks ATN_Worldwide Atomz AURESYS BackRub Big Brother Bjaaland BlackWidow BlitzBOT bumblebee Calif Checkbot CMC combine ComputingSite Robi conceptbot CoolBot cosmos crawler Cusco CyberSpyder Deweb Die Blinde Kuh Digger Digimarc CGIReader Digimarc WebReader DIIbot DNAbot DragonBot Duppies DWCP EbiNess EchO elfinbot Emacs-w3 ESIRover esther Evliya Celebi FelixIDE fido Freecrawl FunnelWeb gazz gcreep gestaltIconoclast GetterroboPlus GetURL Golem Googlebot grabber griffon Gromit grub-client Gulliver havIndex Hazel's Ferret Web hopper Hatena Antenna htdig HTMLgobble ia_archiver IAGENT iajaBot IBM_Planetwide IncyWincy Informant Infoseek Sidewinder INGRID inspectorwww Iron33 JavaBee JBot Jeeves JoBo Jobot JoeBot jumpstation Katipo KDD-Explorer KIT-Fireball LabelGrab larbin legs Linkidator LinkScan Server LinkWalker Lockon logo.gif LWP Lycos Magpie marvin MediaFox MerzScope moget Monster Motor mouse.house msnbot MuscatFerret MwdSearch NEC-MeshExplorer Nederland.zoek NetCarta CyberPilot Pro NetMechanic NetScoop newscan-online NHSEWalker Nomad-V2.x NorthStar Occam Openfind data gatherer PackRat PageBoy ParaSite Patric Peregrinator-Mathematics PGP-KA PiltdownMan Pioneer PlumtreeWebAccessor Poppi PortalJuice.com Powermarks psbot Raven RHCS Road Runner Robbie Robofox robot Robozilla root Roverbot RuLeS Scooter search Senrigan SG-Scout Shagseeker Shai'Hulud SimBot Site Valet SiteTech-Rover Slurp Snooper Solbot Spanner spider StackRambler suke suntek Tarantula TechBOT Teleport Pro Telesoft Templeton teoma test-url The Homepage Finder TITAN TitIn urlck Valkyrie Victoria Voyager VWbot_K w3index W3M2 w3mir WallPaper Web Downloader WebBandit WebCatcher WebCopy WebDAV-MiniRedir WebFetcher weblayers WebLinker WebMoose WebQuest WebReaper webs@ WebTrends webvac webwalk WebWalker WebWatch Wget whatUseek_winona wired-digital-newsbot WOLP WWWC WWWWanderer XGET YahooSeeker Yandex
===== END SPIDERS LIST =====


James


  Printer-friendly page | Top

Conferences | Topics | Previous Topic | Next Topic
 
 
Home    Lite    Professional    Enterprise    Samples    FAQ    Downloads    Purchase    Manual    Support    Contact Us
Copyright © 2006 by Flowerfire. Privacy Policy