Your code is opening, reading, and closing a file for every line, and that definitely will be slow. A better approach, which still lets you use an external text file, is to use a CFG file, for example called regexps.cfg, like this: regexps = {
1 = "<regexp1>"
2 = "<regexp2>"
3 = "<regexp3>"
}
Then, the log filter can use node operations, instead of file operations, to refer to node "regexps" directly:
node regexp;
foreach regexp 'regexps' (
if (matches_regular_expression(file, @regexp)) then (
...
);
);
By referring to 'regexps' above, you are referring to the node 'regexps', which is the file 'regexps.cfg' in LogAnalysisInfo; Sawmill read that in automatically, but only once, and from then on it is very fast to access.
However, I *think* the regular expressions will still have to be recompiled each time, since they aren't constant strings in the filter (which would allow them to be compiled once, and then cached). Therefore, this is faster:
if (matches_regular_expression(url, '<regexp1>') or
matches_regular_expression(url, '<regexp2>') or
matches_regular_expression(url, '<regexp3>'))
then 'reject'
In a sense, the fast approach (having the regular expressions in the profile) is almost as easily maintainable as the regular-expressions-in-a-file approaches, because the profile *is* a text file (in LogAnalysisInfo/profiles), so if you ignore all the other text in the file, you've just got a list of regular expressions again, in a text file, which you can manage with a script, or manually with a text editor. So, for best performance, and not *too* much loss in maintainability, put the expressions in the profile; for slower performance, but easy maintainability, put them in a CFG file.
Another option, actually, which is somewhere between the two, is to put the *code* of the log filter in a CFV file, and then use compile() and exec() to compile it in filter_initialization, and execute it from the filter. This should allow the regular expressions to be compiled (which makes them fast), but would limit the text file to be maintained, to just the text of the log filter, rather than the whole file. I can give you more details if you like. Heck, here are some details, off the cuff: put the log filter text in "myfilter.cfv" in LogAnalysisInfo, and then have this in the profile:
log.filter_initialization = `node myfiltercode = compile('myfilter');`
and then this in the log filter itself:
execute(myfiltercode);
That's all untested, but hopefully correct.
Greg
>Hi Greg,
>
>>What about including this list in the log filter itself,
>>like this:
>>
>> if (matches_regular_expression(url, '<regexp1>') or
>> matches_regular_expression(url, '<regexp2>') or
>> matches_regular_expression(url, '<regexp3>'))
>> then 'reject'
>
>This would be fine except when there are are a large number
>of regexp's and they are subject to change/modification
>regularly. Some are of an extremely long length as well.
>
>>
>>It is also possible to have a list of regular expressions in
>>a CFG file, and have the log filter iterate through them
>>with a "for" loop; but the performance will be better if you
>>hard-code them as above, because Sawmill will automatically
>>compile them in advance, and use the compiled versions for
>>each line.
>
>I was able to mock up a log filter to open up a blacklist
>containing ad url's and iterate through it comparing each
>log url with each ad url, rejecting based on a match. (I've
>pasted the code below. Feel free to critique it.) However,
>the performance was *terrible* as you probably can imagine.
>It would probably be a lot faster if I were to run Sawmill
>over MySQL and somehow offload the filtering to the DB
>engine.
>
>As I mentioned in my first post, the motive here is to be
>able to use existing blacklist files as the source for this
>filtering. This is low maintenance because it only requires
>pointing the path to the existing blacklist directories.
>These lists are already kept up to date by various update
>scripts and manual editing. Filtering on these lists will
>produce much more accurate stats on content filter/proxy
>servers as the vast majority of ad sites are hit without the
>user intending to hit them.
>
>Thanks for the help.
>
>Chris
>
>
>
>string fh;
>string adv_url;
>
>fh = (open_file('LogAnalysisInfo/BL/adv/domains', 'r'));
>
>#echo('Blacklist opened...');
>
>while (!(end_of_file(fh))) (
>
> adv_url = read_line_from_file(fh);
># echo('Testing url:' . adv_url);
> if (contains(url, adv_url)) then "reject";
>
>);
>
>#echo('Blacklist closed...');
>
>close_file(fh);
>
>