Read Digital Edition


ADS BY GOOGLE
Top Three Links You Must Click On


Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter

If the tokenizer we're using considers underscores, dashes, and slashes to be token delimiters, then instead of ending up with four one-word tokens, we'll end up with 14 single-character tokens. Many filter authors believe it's healthy to allow these individual characters to tokenize, while others believe that the resulting information is too generalized to be a good indicator of anything, at least without the risk of false positives.

Filter authors who share the latter philosophy can use token reassembly to join the original tokens back together. Token reassembly isn't a perfect science, but it provides more useful tokens to work with. The tokens "VIA" and "GRA" are much more useful than individual characters and are definitely more indicative of spam. Token reassembly basically concatenates single-character tokens that are adjacent to one another, looking for larger amounts of white space amidst the slicing and dicing to make an educated guess about what words go together. Since statistical filtering involves machine learning and not human learning, tokens like this are very useful to the computer, even though they may not make much sense to us. For example, the token "VIA" really doesn't mean much, which is exactly why it makes a great indicator of spam - you'd rarely see the word "VIA" in a legitimate message unless you were talking about motherboards. The word "GRA" is even more rare in legitimate mail. The fact that these tokens aren't necessarily comprehensible to a human makes it easier to identify them in spams. My dataset considers some of these fractional words to be extreme indicators of spam:

Agra S: 00030 I: 00000 P: 0.9999
Eacute S: 00021 I: 00000 P: 0.9999
Prematur S: 00020 I: 00000 P: 0.9999

Degeneration
Another solution Graham introduced into tokenization is called degeneration. Degeneration allows a token that hasn't been seen before to be reduced in complexity (location, case, and punctuation) until it matches a simpler token. If no tokens match a given token, we make it simpler until we find a match. For example, consider the use of the word "FREE!!!" in the subject. If it has never been seen before in the subject, degeneration has us reduce the phrase until it matches something we have seen before.

Subject*Free!!!
Subject*free!!!
Subject*FREE!
Subject*Free!
Subject*free!
Subject*FREE
Subject*Free
Subject*free
FREE!!!
Free!!!
free!!!
FREE!
Free!
free!
FREE
Free
Free

Degeneration has a lot of room for customization, including the order in which the tokens decrease in complexity. At the very least, degeneration of punctuation is a wise move. If the word "free!" doesn't exist in the dataset yet, it makes good sense to use the value from a similar token.

Header Optimizations
Most filter authors agree that a token in the subject header is very different from a token in the message body, and that a token that appears in two different headers is unique enough to warrant keeping track of. Header tokens are usually processed differently from body tokens in order to maintain the origin of each token. Let's look at an example of an e-mail with a lot of useful header information.

From: bazz@xum2.xumx.com
To: bazz@xum2.xumx.com
Reply-To: mort239o@xum2.xumx.com
Subject: ADV: FREE Mortgage Rate Quote - Save THOUSANDS! kplxl X-Keywords: Save thousands by refinancing now. Apply from the privacy of your home and receive a FREE no-obligation loan quote.
http://211.78.96.11/acct/morquote/

Rates are Down. YOU Win!
Self-Employed or Poor Credit is OK!
Get CASH out or money for Home Improvements, Debt Consolidation and more. Interest rates are at the lowest point in years-right now! This is the perfect time for you to get a FREE quote and find out how much you can save!

In the spam shown here, several different tokens stand out. First, if my e-mail address happened to be bazz@xum2.xumx.com, I wouldn't expect to be seeing it in the From: header, but it would be very normal in the To: header. Seeing my own e-mail address in the From: header would be a clear indicator of spam, since most people don't usually send e-mail to themselves unless they've had too much to drink.

Second, the word "Save" appears in both the subject line and the message body. I would expect to see it in the message body more frequently in legitimate mail - for example, "Save your files in the blue folder" or "Save me from this dreaded cubicle." Seeing the word "Save" in the subject header is much more suspicious, though, and it makes sense for me to have a different entry in the dataset for each of them.

About Jonathan A. Zdziarski
Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

  Subscribe to our RSS feeds now and receive the next article instantly!
In It? Reprint It! Contact advertising(at)sys-con.com to order your reprints!
Subscribe to the World's Most Powerful Newsletters

ADS BY GOOGLE
SugarCRM, the world’s leading provider of open source customer relationship management (CRM) softwa...
If you are like me, you are regularly receiving unsolicited email from various quarters, telling you...
There's a lot of talk about how we need to focus on our buyers' issues and provide them educational ...
SYS-CON Events announced today that the "show prospectus" for the 5th International Cloud Computing ...
SYS-CON Events announced today that the "Diamond" and "Platinum" sponsorship opportunities for the u...
This past weekend I set out explore some of the extension capabilities of Google Wave. One of the we...
More good news for cloud computing! Google last week released its once mysterious Chrome Operating S...
In CloudBerry Lab we are striving to make our customer service better. In this competitive market wi...
We talk a lot about social media on Marketing Trenches. And for good reason – Social media seems to...
Intel has put out its promised beta SDK for Windows (C and C++) and Moblin (C) developers working on...
InformationWeek stumbled on a Microsoft patent application dating back to 2006 deceptively titled “M...
Berlin-based ThinPrint AG, the printer virtualization house, thinks it’s got a cloud solution for th...
But on the web, access to services is implicit in the fact that the business is offering the service...
IBM has acquired Guardium, a seven-year-old subsidiary of Israel’s Log-On Software transplanted to M...
Behaving like it’s got a future, Sun Monday put out what it calls a significant new version of Virtu...
Oracle has offered to cordon off MySQL inside a combined Oracle-Sun to get the European Commission t...
The second set of charges filed last week against Indian outsourcer Satyam Computer Services founder...
Gartner told Reuters that it overestimated how many PCs Acer shipped in the last seven quarters by a...
Apparently Google Gears ain’t gonna stick around that long. Google Apps will eventually get their of...
Office Web Apps, Microsoft’s answer to Google Apps, are supposed to be out sometime in June along wi...