Read Digital Edition


ADS BY GOOGLE
Top Three Links You Must Click On


Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter

A few filters, such as CRM114, perform this type of word skipping, which will tokenize something like "manh+<!rescind>+ood" and also help the filter "see" the original token by performing the word skipping: "manh+ood." Since tokenization is an imperfect process, approaches like this generally provide more machine-readable tokens to deal with, without necessarily requiring much work. The more permutations of machine-readable tokens are created, however, the larger and more spread out the dataset will become, possibly affecting accuracy. The amount of data generated by SBPH generally turns a lot of filter authors off to it in favor of simple functions such as HTML comment filtering.

Internationalization
The tokenization methods discussed thus far have covered only standard character sets. The issue of foreign languages will eventually require a solution. Most spam filters simply use wide characters as placeholders, such as the letter "z" or an asterisk. This functionality allows the filter to catch just about any messages written using a wide character set. Some users, however, may expect to receive e-mail from others speaking such a language, and for them this approach won't function well at all, filtering only based on header data. The rest of the body will look (to the filter) like this:

ZZZZZ,

ZZ ZZZZ ZZZ ZZZZZZZ ZZZ ZZZ Z ZZZZZZ Z ZZZZZZ ZZZZ Z ZZZ ZZZZ
ZZZZZZZ ZZ ZZZZZZ ZZ ZZZ ZZZZZZZ

ZZ,
ZZZZZZZZ

Some filters implement i18n internationalization, which lets their filter support some additional languages. To make matters more complicated, however, some languages don't use white space, making it very difficult to identify words at all. This commonly calls for more advanced solutions such as variable-length nGrams.

Final Thoughts
We've run the gamut of approaches to tokenizing in this article. Tokenizing strives to define content by defining the construct and, more important, what the root components of content are. This is a noble quest but, as with other areas of machine learning, is a function that may eventually be better left up to the computer. As new types of neural decision-making algorithms surface, the analysis of unformatted text may become one of the next forms of AI. Until this happens, tokenizing remains one of the few heuristic components of a statistical spam filter. It should therefore be respected and kept somewhat simple, so as not to require any maintenance in the years to come.

This article is an excerpt from Ending Spam: Bayesian Content Filtering and the Art of Statistical Language Classification.
Printed with permission from No Starch Press. Copyright 2005.

About Jonathan A. Zdziarski
Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

  Subscribe to our RSS feeds now and receive the next article instantly!
In It? Reprint It! Contact advertising(at)sys-con.com to order your reprints!
Subscribe to the World's Most Powerful Newsletters

ADS BY GOOGLE
IBM has acquired Guardium, a seven-year-old subsidiary of Israel’s Log-On Software transplanted to M...
But on the web, access to services is implicit in the fact that the business is offering the service...
We talk a lot about social media on Marketing Trenches. And for good reason – Social media seems to...
Intel has put out its promised beta SDK for Windows (C and C++) and Moblin (C) developers working on...
InformationWeek stumbled on a Microsoft patent application dating back to 2006 deceptively titled “M...
Behaving like it’s got a future, Sun Monday put out what it calls a significant new version of Virtu...
Berlin-based ThinPrint AG, the printer virtualization house, thinks it’s got a cloud solution for th...
Oracle has offered to cordon off MySQL inside a combined Oracle-Sun to get the European Commission t...
The second set of charges filed last week against Indian outsourcer Satyam Computer Services founder...
Gartner told Reuters that it overestimated how many PCs Acer shipped in the last seven quarters by a...
Office Web Apps, Microsoft’s answer to Google Apps, are supposed to be out sometime in June along wi...
Gartner thinks the server business has stopped sliding into the abyss. Third-quarter sales weren’t a...
Gartner is buying ~$40 million-a-year AMR Research Inc for close to $64 million in cash. AMD special...
Singed by user reaction to its plans to up the price of its support contracts, SAP Tuesday postponed...
Apparently Google Gears ain’t gonna stick around that long. Google Apps will eventually get their of...
Oracle seems to have divided the open source ranks over the MySQL delay it’s having closing its acqu...
We hear – well, you know how people talk – that Oracle has been quietly meeting with the European Co...
The Korean government is going to sink around $172 million into cloud computing next year under a st...
In response to Opera’s complaints Microsoft has reportedly modified the proposed ballot screen that’...
Microsoft has sold the Folio and NXT businesses it got when it bought Fast Search and Transfer, the ...