|
SYS-CON Magazines
|
Top Three Links You Must Click On
Product Spotlight Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter
Oct. 3, 2005 03:00 AM
A few filters, such as CRM114, perform this type of word skipping, which will tokenize something like "manh+<!rescind>+ood" and also help the filter "see" the original token by performing the word skipping: "manh+ood." Since tokenization is an imperfect process, approaches like this generally provide more machine-readable tokens to deal with, without necessarily requiring much work. The more permutations of machine-readable tokens are created, however, the larger and more spread out the dataset will become, possibly affecting accuracy. The amount of data generated by SBPH generally turns a lot of filter authors off to it in favor of simple functions such as HTML comment filtering.
Internationalization
ZZZZZ, Some filters implement i18n internationalization, which lets their filter support some additional languages. To make matters more complicated, however, some languages don't use white space, making it very difficult to identify words at all. This commonly calls for more advanced solutions such as variable-length nGrams.
Final Thoughts
This article is an excerpt from Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification. Reader Feedback: Page 1 of 1
Subscribe to our RSS feeds now and receive the next article instantly!
Subscribe to the World's Most Powerful Newsletters
|
|
||||||||||||||||||||||||||||||||||