|
SYS-CON Magazines
|
Top Three Links You Must Click On
Product Spotlight Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter
Oct. 3, 2005 03:00 AM
Redundancy Some authors will view punctuation as part of a token only if it appears at the end of the token. If an exclamation point appears elsewhere, it will be treated as a delimiter in most cases. For those punctuation marks that are permitted, we should consider working some method of de-duplication into our tokenizer, where only the first occurrence of the punctuation is used. We essentially look at "free!!!", "free!!!!!!!!!!", and "free!" as the same token by truncating the extra chaff. I've found that using the exclamation point as a constituent character slightly improves accuracy, which is the opposite effect that question marks appeared to have. This is probably because more spams use an obnoxiously loud used-car-salesman type of pretense rather than actually posing questions. Perhaps one day, spammers will become more philosophical, and then question marks will become just as useful as exclamation points. Some filters permit a certain window size before the token is truncated; for example, tokens may be allowed to have up to three exclamation points before being truncated, giving the filter three different meanings for "free!", "free!!", and the extremely guilty and shameless "free!!!" One of the advantages to doing this, other than measuring the three levels of unbridled fervor, is that it allows a really obnoxious message that uses all three tokens to fill up more slots in the decision matrix. It's important to truncate extraneous characters at some level because spammers could easily use not truncating them as a way to hide very spammy tokens; for example, a spammer wanting to hide the word "porn" could send "porn!!!" in the first spam and "porn!?!?!" the next time, so that in both cases the token would be considered a new token. Truncating will reduce both of these tokens to "porn!" or even "porn" if exclamation points are ignored all together. Tokens should generally be limited to only one acceptable punctuation mark at the end, or to an N-sized window of homogeneous punctuations at the most.
Other Delimiters
Exceptions The tokenizing algorithm should be generic in such a way that it can easily break down any kind of natural language or new type of message style, but it shouldn't be so plain vanilla that the features it generates are likely to appear as common in all e-mail. It would be relatively easy to tokenize a message into individual characters, but that wouldn't be very useful, since the token "v" could occur in "viagra" or "violin". All-numeric tokens are generally not very useful on their own, but when combined with the proper punctuation (such as a dollar sign or exclamation mark) can make a significant distinction between "19" and "$19" or between "95" and "95!". Provide enough information to allow the token to be set apart from the rest, but not so much that it is unlikely to show up only a handful of times. To some degree, this anal-retentive exercise is overrated. Any reasonable level of tokenization will most likely yield levels of accuracy above 99 percent, but making a mistake could cost a few misclassifications on occasion. I've found that using the question mark as a constituent character in my tests resulted in approximately three additional errors per 5,000. Experimentation and thorough testing is one of the best ways to decide on the tokenization approach that works best for the filter.
Token Reassembly C/A/L/L/ N-O-W - I/T/S F_R_E_E Reader Feedback: Page 1 of 1
Subscribe to our RSS feeds now and receive the next article instantly!
Subscribe to the World's Most Powerful Newsletters
|
|
||||||||||||||||||||||||||||||||||