|
SYS-CON Magazines
|
Top Three Links You Must Click On
Product Spotlight Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter
Oct. 3, 2005 03:00 AM
The word "FREE" also shows up in both the subject line and message body but, in this case, they're both very guilty indicators of spam. The filter still benefits here because the tokens "FREE" and "Subject*FREE" now have the ability to take up two slots in my decision matrix, further condemning the spam. Header tokens are extremely useful for identifying both spam and legitimate mail. Other types of header tokens are frequently found to be useful, and the set of delimiters used in the headers is usually slightly different from those used in the message body. For example, if I want to catch all of the IP addresses in the Received: headers, I would treat a period as a constituent character (part of the token) instead of a separator. If I wanted to tokenize the message-id, I'd also include the @ sign as a delimiter, as it is used to separate some pieces of the message-id. Another advantage of including the header as part of the token is that it helps to create a virtual "whitelist" of users you trust. If I exchange a lot of correspondence with bobsmith@somedomain.com, tokens like "From*bobsmith" and "From*yourcompany.com" will start to appear in the dataset, usually with very innocent values. This works equally well in identifying the hostnames of trusted mail servers in the Received: header too.
URL Optimizations URLs are frequently tokenized differently than the rest of a message. The only delimiters usually used when tokenizing a URL are the slash, question mark, equal sign, period, and colon, although some filter authors perform the same basic type of token separation as they do in the rest of the message body. Tokenizing using URL-specific delimiters is done because the individual tokens are more frequently found based on their path in the URL, rather than on a specific context inside the URL. Regardless of how they are tokenized, URLs, when analyzed, can yield a lot of useful information. They can be categorized as places you want to go and places you don't want to go. A spam containing places you don't want to go is just as informative as a legitimate message containing places you do.
Url*getitrightnowwholesale S: 00026 I: 00000 P: 0.9999 Ironically, legitimate URLs seem to be rare among spammers, while the wild and obnoxious names always pop up, with the exception of "java," of course, which appeared as spammy only because this user doesn't use Java (not because Java programmers were spamming). The appearance of certain naming conventions, such as the extensive use of "img," makes the task of identifying malicious URLs pretty easy. If we wanted to, we could probably determine the disposition of the message based on the URL information alone. Ironically, URLs containing well-known Web addresses are likely to appear as innocent or hapaxes. Not a single URL token containing the following words has ever appeared in my corpus as spammy:
HTML Tokenization
Reader Feedback: Page 1 of 1
Subscribe to our RSS feeds now and receive the next article instantly!
Subscribe to the World's Most Powerful Newsletters
|
|
||||||||||||||||||||||||||||||||||||||||||