FAQ: Combining Referring Domains


How can I combine referrers, so hits from http://search.yahoo.com, http://dir.yahoo.com, and http://google.yahoo.com are combined into a single entry?

Short Answer

Create a log filter converting all the hostnames to the same hostname.

Long Answer

You can do this by converting all of the hostnames to a single hostname, so for instance they all appear as http://yahoo.com referrers. To do this, you need to convert all occurrences of /search.yahoo.com/, /dir.yahoo.com/, or /www.yahoo.com/ into /yahoo.com/, in the referrer field. The easiest way is to make three log filters, in the Log Filters section of the Config part of your profile:

Then rebuild the database; the resulting statistics will combine all three referrers in a single /yahoo.com/ referrer.

A more sophisticated filter is necessary if you need to preserve some parts of the URL while converting others. In that case, you can use a regular expression filter:

The way this works is it matches any referrer starting with http://us.fN.mail.yahoo.com/ym/ (where N is any integer), and while it's matching, it extracts everything after the /ym/ into the variable 1. The leading ^ ensures that the referrer starts with http://, the trailing ensures that the parenthesized .* section contains all of the remainder after /ym/, [0-9]* matches any integer, and \. matches a single period (see Regular Expressions for more information about regular expressions). If it matches, it sets the referrer field to http://us.f*.mail.yahoo.com/1, where 1 is the value extracted from the original URL. This allows you to collapse all http://us.fN.mail.yahoo.com/ URLs into a single one without losing the extra data beyond /ym/. If you don't care about the data beyond /ym/, you can use somewhat simpler (or at least easier-to-understand) filter:

This one uses a wildcard comparison (if matches wildcard expression) rather than a regular expression, which allows the use of * in the expression in its more generally used meaning of "match anything". Note also that in the first line, * appears twice and each time matches anything, but in the second line it appears only once, and is a literal *, not a "match-anything" character.