Okay...here's my solution to splitting the spiders out of my transfer.log to keep them from confusing:1) Modify Apache's httpd.conf to keep spiders from being logged in the same file as "normal" visitors. That gives you clean files so you won't have to do items 2a and 2b over and over again. Here are the httpd.conf directives I used:
# combined log format (4 lines: LogFormat,SetEnvIf,SetEnvIf,CustomLog)
LogFormat "%a %{SID}e %l %u %t \"%v%{Request_URI}e\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" jblog
SetEnvIf Request_URI
\.ida|\.exe|\.dll|OPTIONS|CONNECT|\.cfm|\/race\/|formmail|FORMMAIL|Formmail|MSOffice|\/ctruls\/|\"|<|>|XXXXX$ donotlog
SetEnvIf User-Agent AlkalineBOT|appie|<snip>|YahooSeeker|Yandex$ donotlog
CustomLog logs/transfer.log jblog env=!donotlog
# hacker log format (3 lines: LogFormat,SetEnvIf,CustomLog)
LogFormat "%a %{SID}e %l %u %t \"%r\" %>s %b \"%{Referer}i\"" hacklog
SetEnvIf Request_URI
\.ida|\.exe|\.dll|OPTIONS|CONNECT|\.cfm|\/race\/|formmail|FORMMAIL|Formmail|MSOffice|\/ctruls\/|\"|<|>|XXXXX$ dolog
CustomLog logs/hacker.log hacklog env=dolog
# spider log format (3 lines: LogFormat,SetEnvIf,CustomLog)
LogFormat "%a %{SID}e %l %u %t \"%r\" %>s %b \"%{Referer}i\"" spiderlog
SetEnvIf User-Agent AlkalineBOT|appie|<snip>|YahooSeeker|Yandex$ spidlog
CustomLog logs/spider.log spiderlog env=spidlog
2) Clean spider entries out of existing/historical logs. I chose a two-part process because this is a one-time thing for me. Apologies to non-'Nix admins...you'll need a different process:
a) Separate spider entries into their own log file:
egrep -i -f spider_list.txt -hh transfer.log > transfer.S.log
b) Separate non-spider entries to a fresh log file:
egrep -i -v -f spider_list.txt -hh transfer.log > transfer.C.log
You can get rid of the original, and use transfer.C.log (renamed) as your primary access log file. Remember to do step 1 first, though, otherwise you will need to do 2a and 2b again.
Here is a <snipped> version of "spider_list.txt", the file used by egrep to define the regular expression matching pattern. Note that this is for an Apache Combined Log Format and should be modified before applying it to any other log format:
^.*\..*\..*\..* .* .* .* \<.*\> \".*\" .* .* \".*\" \".*(AlkalineBOT|appie|<snip>|YahooSeeker|Yandex).*$
It's all one line, and has entries for all of the spiders in Sawmill's "Spiders" config file plus any others you need (see my list at the end of this message), separated by pipes ("or" operators in this expression).
Once I had clean log files, I simply rebuilt the Sawmill db (didn't change any filters), and voila..."real" visitors almost only. "Almost" because a couple of oddball spiders weren't on my list and squeaked through. Small matter.
Enjoy!
(Graham: I did post the suggestion to Feature Requests. Thanks!)
===== COMPLETE SPIDERS LIST I AM USING =====
AlkalineBOT appie Arachnophilia arks ATN_Worldwide Atomz AURESYS BackRub Big Brother Bjaaland BlackWidow BlitzBOT bumblebee Calif Checkbot CMC combine ComputingSite Robi conceptbot CoolBot cosmos crawler Cusco CyberSpyder Deweb Die Blinde Kuh Digger Digimarc CGIReader Digimarc WebReader DIIbot DNAbot DragonBot Duppies DWCP EbiNess EchO elfinbot Emacs-w3 ESIRover esther Evliya Celebi FelixIDE fido Freecrawl FunnelWeb gazz gcreep gestaltIconoclast GetterroboPlus GetURL Golem Googlebot grabber griffon Gromit grub-client Gulliver havIndex Hazel's Ferret Web hopper Hatena Antenna htdig HTMLgobble ia_archiver IAGENT iajaBot IBM_Planetwide IncyWincy Informant Infoseek Sidewinder INGRID inspectorwww Iron33 JavaBee JBot Jeeves JoBo Jobot JoeBot jumpstation Katipo KDD-Explorer KIT-Fireball LabelGrab larbin legs Linkidator LinkScan Server LinkWalker Lockon logo.gif LWP Lycos Magpie marvin MediaFox MerzScope moget Monster Motor mouse.house msnbot MuscatFerret MwdSearch NEC-MeshExplorer Nederland.zoek NetCarta CyberPilot Pro NetMechanic NetScoop newscan-online NHSEWalker Nomad-V2.x NorthStar Occam Openfind data gatherer PackRat PageBoy ParaSite Patric Peregrinator-Mathematics PGP-KA PiltdownMan Pioneer PlumtreeWebAccessor Poppi PortalJuice.com Powermarks psbot Raven RHCS Road Runner Robbie Robofox robot Robozilla root Roverbot RuLeS Scooter search Senrigan SG-Scout Shagseeker Shai'Hulud SimBot Site Valet SiteTech-Rover Slurp Snooper Solbot Spanner spider StackRambler suke suntek Tarantula TechBOT Teleport Pro Telesoft Templeton teoma test-url The Homepage Finder TITAN TitIn urlck Valkyrie Victoria Voyager VWbot_K w3index W3M2 w3mir WallPaper Web Downloader WebBandit WebCatcher WebCopy WebDAV-MiniRedir WebFetcher weblayers WebLinker WebMoose WebQuest WebReaper webs@ WebTrends webvac webwalk WebWalker WebWatch Wget whatUseek_winona wired-digital-newsbot WOLP WWWC WWWWanderer XGET YahooSeeker Yandex
===== END SPIDERS LIST =====
James