Mike Shor: Web Log Customization Files
I use Analog for my log analysis. Over the years, I have developed a rather comprehensive list of search engines (over 10,000). Also, I keep several other types of configuration files. This list is updated (roughly) quarterly.
Last Update: 3 May 2007
Note: the ref-spam list has grown a bit beyond its original use, and now lists quite a few other kinds of baddies (made-for-Google, phishing, etc.)
Currently, the files below contain:
- 6,940 SEARCHENGINE entries
- 8,344 REFEXCLUDE entries
- 461 TYPEALIAS entries
- 668 ROBOTINCLUDE entries,
- SearchEngines.txt
is the latest listing of search engines. While this was built as a search engine list for Analog, it is quite easy to modify for other log analysis programs. - RefSpam.txt
is a list of domains who are known to spam referral lists. Again configured for analog (REFEXCLUDE commands), this can serve as the basis of other blacklists or lists to prevent referrer spam.
An alternate list, RefExcSpam.txt, uses REFREPEXCLUDE, REFSITEEXCLUDE, etc., commands to remove spam from reports but still count the "hits" (for whatever reason). - TypeAlias.txt
is a list of file type aliases, providing a description of many common filetypes. - RobotInclude.txt
is a list of known search engine robots and other robots.
To use these files, simply save them in your analog directory, and add the following to your analog.cfg file:
CONFIGFILE SearchEngines.txt
CONFIGFILE RobotInclude.txt
CONFIGFILE TypeAlias.txt
CONFIGFILE RefSpam.txt
Please send any corrections or additions to: analog AT mikeshor.com
Back to my homepage