Regular Expressions
Regular expressions are a powerful method for defining a class of strings (strings are sequences of characters; for instance, a filename is a string, and so is a log entry). Sawmill uses regular expressions in many places, including:
-
You can specify the log files to process using a regular expression.
-
You can specify the log file format using Log data format regular expression.
-
You can filter log entries based on a regular expression using log filters.
You can also use wildcard expressions in these cases, using * to match any string of characters, or ? to match any single character (for instance, *.gif to match anything ending with .gif, or Jan/??/2000 to match any date in January, 2000). Wildcard expressions are easier to use than regular expressions, but are not nearly as powerful.
Regular expressions can be extremely complex, and it is beyond the scope of this manual to describe them in full detail.
In brief, a regular expression is a pattern, which is essentially the string to match, plus special characters which match classes of string, plus operators to combine them. Here are the simplest rules:
-
A letter or digit matches itself (most other characters do as well).
-
The . character (a period) matches any character.
-
The * character matches zero or more repetitions of the expression before it.
-
The + character matches one or more repetitions of the expression before it.
-
The ^ character matches the beginning of the string, or the beginning of a line in a multi-line string.
-
The $ character matches the ending of the string, or the end of a line in a multi-line string.
-
A square-bracketed series of characters matches any of those characters. Adding a ^ after the opening bracket matches any character except those in the brackets.
-
Two regular expressions in a row match any combination where the first half matches the first expression, and the second half matches the second expression.
-
The \ character followed by any other character matches that character. For example, \* matches the * character.
-
A regular expression matches if it matches any part of the string; i.e. unless you explicitly include ^ and/or $, the regular expression will match if it matches something in the middle of the string. For example, "access\.log" matches not only
access.log
but alsoold_access.log
andaccess.logs
. -
A parenthesized regular expression matches the same thing as it does without parentheses, but is considered a single expression (for instance by a trailing *). Parentheses can be used to group consecutive expressions into a single expression. Each field should be parenthesized when using Log data format regular expression; that's how Sawmill knows where each field is.
-
An expression of the form (A|B) matches either expression A or expression B. There can also be more than two expressions in the list.
The list goes on, but is too large to include here in complete form. See the Yahoo link above. Some examples:
-
a matches any value containing the letter a.
-
ac matches any value containing the letter a followed by the letter c.
-
word matches any value containing the sequence "word".
-
worda* matches any value containing the sequence "word" followed by zero or more a's.
-
(word)*a matches any value containing zero or more consecutive repetitions of "word", where the last repetition followed by an a.
-
\.log$ matches any value ending with .log (good for matching all files in a directory ending with .log).
-
^ex.*\.log$ matches any value starting with ex and ending with .log.
-
^access_log.*1 matches any value starting with "access_log", and containing a 1 somewhere after the leading "access_log" (note that the 1 does not have to be at the end of the string for this to match; if you want to require that the 1 be at the end, add a $ to the end of the expression).
-
^access_log_jan....2004$ matches any value starting with "access_log_jan", followed by four characters (any four characters), followed by "2004", followed immediately by the end of the value.
As you can see, regular expressions are extremely powerful; a pattern can be devised to match almost any conceivable need.
NOTE ABOUT FILTERS
Both regular expression pattern filters and DOS-style pattern filters are necessary in some cases, but they should be avoided when possible because pattern filters can be considerably slower than the simpler filter types like "ends with" or "contains". If you can create a filter without patterns, do--your log processing will be faster.