|
SYS-CON Magazines
|
Top Three Links You Must Click On
Product Spotlight Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter
Oct. 3, 2005 03:00 AM
It is probably better to use an exclusionary list rather than an inclusionary one. You're more likely to miss a few tags or possibly to fail to name certain tags you never thought could be used in spam (for example, the object tag has recently become popular). If this happens, at worst the tag will sit and collect dust in the dataset with some neutral value or will fill up a decision matrix slot in error. If you fail to add a tag to an inclusive list, though, you're bound to ignore an important data point and may not even realize it. Some of the HTML tags commonly used by spammers (which a filter should definitely be looking at) include the following: Some filters like to mark the tokens generated from HTML tags with an "HTML" identifier, while others go so far as to mark the particular tag the text belonged to (for example, "BODY:BGCOLOR=#FFFFFF"). Regardless of which tags the filter decides to keep and which get discarded, it's very important to handle HTML comments correctly. Spammers are using many tricks to obfuscate their text so that it's human readable, but not very machine readable. For example, the following may look like a complete mess in its machine-readable format:
Received: from 64.202.131.2 (h0007e9075130.ne.client2.attbi.com But when the user clicks the message to read it, the HTML comments won't be visible and the user will see this: Yes you heard about these weird little pills that are supposed to make you bigger and of course you think they're bogus snake potion. Well, let's look at the facts: GRX2 has been sold over 1.9 Million times within the last 18 months... With awesome results for hundreds of thousands of men all over the planet! They all enjoy a seriously enhanced version of their manhood and why shouldn't you? A simple way to ensure that the message is tokenized correctly is to remove the HTML comments and reassemble the message.
Word Pairs
Sparse Binary Polynomial Hashing Another project I heard about . . . was Bill Yerazunis' CRM114. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing. SBPH tokenizes entire phrases, up to five tokens across, and allows for word skipping in between. It led the way in terms of accuracy for a long period of time, but it also created an enormous amount of data, which is one of the reasons it presently functions only in a train-on-error environment. SBPH provides the benefit of using the simplest, most colloquial tokens but giving special notice to more complex tokens as well, which are usually much stronger indicators of spam when they appear. Reader Feedback: Page 1 of 1
Subscribe to our RSS feeds now and receive the next article instantly!
Subscribe to the World's Most Powerful Newsletters
|
|
||||||||||||||||||||||||||||||||||