Sawmill

DOWNLOAD
SAWMILL 8.0.3
free trial!!
Home Products Downloads Purchase Support About
Sawmill Sawmill

SAWMILLFORUM

Sawmill Discussion Forum

Subject: "Regex Filter Expansion"     Previous Topic | Next Topic
Printer-friendly copy    
Conferences Feature Requests Topic #334
Reading Topic #334
fbcit
Member since Jan-21-09
8 posts
Jan-21-09, 07:07 AM (PDT)
Click to EMail fbcit Click to send private message to fbcit Click to view user profileClick to add this user to your buddy list  
"Regex Filter Expansion"
 
   It would be nice to be able to build a regex filter which points to a file containing a list of regex's to filter on. This type of capability is available in Squid acls and Dansguardian lists. Based on my own programming experience it should be trivial to implement.

The motivation behind this request is this: Such a feature will allow one to filter out ad traffic based on standard available ad blacklist files rather than having to construct unique regex filters for each site or group of sites.

Thanks,
Chris


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote | Top
dgilmoreadmin
Member since Nov-18-04
2925 posts
Jan-21-09, 05:41 PM (PDT)
Click to EMail dgilmore Click to send private message to dgilmore Click to view user profileClick to add this user to your buddy list Click to send message via AOL IM  
1. "RE: Regex Filter Expansion"
In response to message #0
 
LAST EDITED ON Jan-21-09 AT 05:41 PM (PDT)
 
Hi-

I've sent this request over to development, and it will be added to the feature request list.

I will check to see if there's something that can be done with the existing infrastructure of Sawmill 8.

David
Sawmill Product Support Team
support@flowerfire.com


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote | Top
ferraradmin
Member since Sep-5-01
3437 posts
Jan-26-09, 04:26 PM (PDT)
Click to EMail ferrar Click to send private message to ferrar Click to view user profileClick to add this user to your buddy list  
2. "RE: Regex Filter Expansion"
In response to message #0
 
What about including this list in the log filter itself, like this:

if (matches_regular_expression(url, '<regexp1>') or
matches_regular_expression(url, '<regexp2>') or
matches_regular_expression(url, '<regexp3>'))
then 'reject'

It is also possible to have a list of regular expressions in a CFG file, and have the log filter iterate through them with a "for" loop; but the performance will be better if you hard-code them as above, because Sawmill will automatically compile them in advance, and use the compiled versions for each line.


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote | Top
fbcit
Member since Jan-21-09
8 posts
Jan-26-09, 05:18 PM (PDT)
Click to EMail fbcit Click to send private message to fbcit Click to view user profileClick to add this user to your buddy list  
3. "RE: Regex Filter Expansion"
In response to message #2
 
   Hi Greg,

>What about including this list in the log filter itself,
>like this:
>
> if (matches_regular_expression(url, '<regexp1>') or
> matches_regular_expression(url, '<regexp2>') or
> matches_regular_expression(url, '<regexp3>'))
> then 'reject'

This would be fine except when there are are a large number of regexp's and they are subject to change/modification regularly. Some are of an extremely long length as well.

>
>It is also possible to have a list of regular expressions in
>a CFG file, and have the log filter iterate through them
>with a "for" loop; but the performance will be better if you
>hard-code them as above, because Sawmill will automatically
>compile them in advance, and use the compiled versions for
>each line.

I was able to mock up a log filter to open up a blacklist containing ad url's and iterate through it comparing each log url with each ad url, rejecting based on a match. (I've pasted the code below. Feel free to critique it.) However, the performance was *terrible* as you probably can imagine. It would probably be a lot faster if I were to run Sawmill over MySQL and somehow offload the filtering to the DB engine.

As I mentioned in my first post, the motive here is to be able to use existing blacklist files as the source for this filtering. This is low maintenance because it only requires pointing the path to the existing blacklist directories. These lists are already kept up to date by various update scripts and manual editing. Filtering on these lists will produce much more accurate stats on content filter/proxy servers as the vast majority of ad sites are hit without the user intending to hit them.

Thanks for the help.

Chris

string fh;
string adv_url;

fh = (open_file('LogAnalysisInfo/BL/adv/domains', 'r'));

#echo('Blacklist opened...');

while (!(end_of_file(fh))) (

adv_url = read_line_from_file(fh);
# echo('Testing url:' . adv_url);
if (contains(url, adv_url)) then "reject";

);

#echo('Blacklist closed...');

close_file(fh);


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote | Top
ferraradmin
Member since Sep-5-01
3437 posts
Feb-12-09, 02:25 PM (PDT)
Click to EMail ferrar Click to send private message to ferrar Click to view user profileClick to add this user to your buddy list  
4. "RE: Regex Filter Expansion"
In response to message #3
 
Your code is opening, reading, and closing a file for every line, and that definitely will be slow. A better approach, which still lets you use an external text file, is to use a CFG file, for example called regexps.cfg, like this:

regexps = {
1 = "<regexp1>"
2 = "<regexp2>"
3 = "<regexp3>"
}

Then, the log filter can use node operations, instead of file operations, to refer to node "regexps" directly:

node regexp;
foreach regexp 'regexps' (
if (matches_regular_expression(file, @regexp)) then (
...
);
);

By referring to 'regexps' above, you are referring to the node 'regexps', which is the file 'regexps.cfg' in LogAnalysisInfo; Sawmill read that in automatically, but only once, and from then on it is very fast to access.

However, I *think* the regular expressions will still have to be recompiled each time, since they aren't constant strings in the filter (which would allow them to be compiled once, and then cached). Therefore, this is faster:

if (matches_regular_expression(url, '<regexp1>') or
matches_regular_expression(url, '<regexp2>') or
matches_regular_expression(url, '<regexp3>'))
then 'reject'

In a sense, the fast approach (having the regular expressions in the profile) is almost as easily maintainable as the regular-expressions-in-a-file approaches, because the profile *is* a text file (in LogAnalysisInfo/profiles), so if you ignore all the other text in the file, you've just got a list of regular expressions again, in a text file, which you can manage with a script, or manually with a text editor. So, for best performance, and not *too* much loss in maintainability, put the expressions in the profile; for slower performance, but easy maintainability, put them in a CFG file.

Another option, actually, which is somewhere between the two, is to put the *code* of the log filter in a CFV file, and then use compile() and exec() to compile it in filter_initialization, and execute it from the filter. This should allow the regular expressions to be compiled (which makes them fast), but would limit the text file to be maintained, to just the text of the log filter, rather than the whole file. I can give you more details if you like. Heck, here are some details, off the cuff: put the log filter text in "myfilter.cfv" in LogAnalysisInfo, and then have this in the profile:

log.filter_initialization = `node myfiltercode = compile('myfilter');`

and then this in the log filter itself:

execute(myfiltercode);

That's all untested, but hopefully correct.


Greg

>Hi Greg,
>
>>What about including this list in the log filter itself,
>>like this:
>>
>> if (matches_regular_expression(url, '<regexp1>') or
>> matches_regular_expression(url, '<regexp2>') or
>> matches_regular_expression(url, '<regexp3>'))
>> then 'reject'
>
>This would be fine except when there are are a large number
>of regexp's and they are subject to change/modification
>regularly. Some are of an extremely long length as well.
>
>>
>>It is also possible to have a list of regular expressions in
>>a CFG file, and have the log filter iterate through them
>>with a "for" loop; but the performance will be better if you
>>hard-code them as above, because Sawmill will automatically
>>compile them in advance, and use the compiled versions for
>>each line.
>
>I was able to mock up a log filter to open up a blacklist
>containing ad url's and iterate through it comparing each
>log url with each ad url, rejecting based on a match. (I've
>pasted the code below. Feel free to critique it.) However,
>the performance was *terrible* as you probably can imagine.
>It would probably be a lot faster if I were to run Sawmill
>over MySQL and somehow offload the filtering to the DB
>engine.
>
>As I mentioned in my first post, the motive here is to be
>able to use existing blacklist files as the source for this
>filtering. This is low maintenance because it only requires
>pointing the path to the existing blacklist directories.
>These lists are already kept up to date by various update
>scripts and manual editing. Filtering on these lists will
>produce much more accurate stats on content filter/proxy
>servers as the vast majority of ad sites are hit without the
>user intending to hit them.
>
>Thanks for the help.
>
>Chris
>
>

 
>
>string fh;
>string adv_url;
>
>fh = (open_file('LogAnalysisInfo/BL/adv/domains', 'r'));
>
>#echo('Blacklist opened...');
>
>while (!(end_of_file(fh))) (
>
> adv_url = read_line_from_file(fh);
># echo('Testing url:' . adv_url);
> if (contains(url, adv_url)) then "reject";
>
>);
>
>#echo('Blacklist closed...');
>
>close_file(fh);
>
>


  Alert | IP Printer-friendly page | Edit | Reply | Reply With Quote | Top

Conferences | Topics | Previous Topic | Next Topic
Sawmill Software
Sawmill Software
Back to Sawmill Home