Read Digital Edition


ADS BY GOOGLE
Top Three Links You Must Click On


Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter

Redundancy
Some types of punctuation are very useful; for example, the exclamation point makes a remarkable difference between "free" and "free!" and so you want to use some punctuation marks as constituent characters. One of the problems a filter author might run into when allowing these types of characters, however, is redundancy. Most would agree that there's no real difference between "free!" and "free!!!!" in a message, as both are equally condemning characteristics of spam. On the other hand, messages in which symbols are used to b!r!e!a!k up a word may behave a bit differently.

Some authors will view punctuation as part of a token only if it appears at the end of the token. If an exclamation point appears elsewhere, it will be treated as a delimiter in most cases. For those punctuation marks that are permitted, we should consider working some method of de-duplication into our tokenizer, where only the first occurrence of the punctuation is used. We essentially look at "free!!!", "free!!!!!!!!!!", and "free!" as the same token by truncating the extra chaff. I've found that using the exclamation point as a constituent character slightly improves accuracy, which is the opposite effect that question marks appeared to have. This is probably because more spams use an obnoxiously loud used-car-salesman type of pretense rather than actually posing questions. Perhaps one day, spammers will become more philosophical, and then question marks will become just as useful as exclamation points.

Some filters permit a certain window size before the token is truncated; for example, tokens may be allowed to have up to three exclamation points before being truncated, giving the filter three different meanings for "free!", "free!!", and the extremely guilty and shameless "free!!!" One of the advantages to doing this, other than measuring the three levels of unbridled fervor, is that it allows a really obnoxious message that uses all three tokens to fill up more slots in the decision matrix.

It's important to truncate extraneous characters at some level because spammers could easily use not truncating them as a way to hide very spammy tokens; for example, a spammer wanting to hide the word "porn" could send "porn!!!" in the first spam and "porn!?!?!" the next time, so that in both cases the token would be considered a new token. Truncating will reduce both of these tokens to "porn!" or even "porn" if exclamation points are ignored all together. Tokens should generally be limited to only one acceptable punctuation mark at the end, or to an N-sized window of homogeneous punctuations at the most.

Other Delimiters
Other delimiters used by many applications include the following:

  • brackets [ ]
  • braces { }
  • parentheses ( )
  • mathematical operators + - / * = < >
  • special characters | & ~ `
  • the at (@) sign
  • underscores and other rare characters
These delimiters frequently prevent the duplication of several different permutations of tokens, such as "when" and "(when". Other characters, such as the new line character, are also treated as delimiters. The nice thing about the way text is delimited is that it's going to result in unique tokens, even if the tokenization isn't perfect. This can be good or bad, but most of the time it's good. Even a token that isn't in human-readable format may be machine-readable and may occur with enough frequency to be a good identifier. In fact, Bayesian antivirus filtering uses an entirely different set of delimiters, because antivirus analysis involves the cataloging and analysis of several different binary sequences.

Exceptions
Some exceptions to the basic delimiters we've mentioned involve one-off instances where we actually want to preserve certain complete tokens. For example, IP addresses make for good spam markers, as do certain HTML characters like © and  . If you're reading this book, there is most likely no shortage of spam in your inbox (or quarantine). Often the best way to discover new approaches to tokenization is to take a look at some of the text spammers are using in their samples. It's very important that the tokenizing approaches being used aren't biased against present-day spam.

The tokenizing algorithm should be generic in such a way that it can easily break down any kind of natural language or new type of message style, but it shouldn't be so plain vanilla that the features it generates are likely to appear as common in all e-mail. It would be relatively easy to tokenize a message into individual characters, but that wouldn't be very useful, since the token "v" could occur in "viagra" or "violin". All-numeric tokens are generally not very useful on their own, but when combined with the proper punctuation (such as a dollar sign or exclamation mark) can make a significant distinction between "19" and "$19" or between "95" and "95!". Provide enough information to allow the token to be set apart from the rest, but not so much that it is unlikely to show up only a handful of times.

To some degree, this anal-retentive exercise is overrated. Any reasonable level of tokenization will most likely yield levels of accuracy above 99 percent, but making a mistake could cost a few misclassifications on occasion. I've found that using the question mark as a constituent character in my tests resulted in approximately three additional errors per 5,000. Experimentation and thorough testing is one of the best ways to decide on the tokenization approach that works best for the filter.

Token Reassembly
Occasionally, tokens will turn out to be a little too small due to attempts by spammers to obfuscate them. When this happens, reassembling individual letters into a token can help improve accuracy. Let's look at an example of obfuscated text:

C/A/L/L/ N-O-W - I/T/S F_R_E_E

About Jonathan A. Zdziarski
Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

  Subscribe to our RSS feeds now and receive the next article instantly!
In It? Reprint It! Contact advertising(at)sys-con.com to order your reprints!
Subscribe to the World's Most Powerful Newsletters

ADS BY GOOGLE
This past weekend I set out explore some of the extension capabilities of Google Wave. One of the we...
More good news for cloud computing! Google last week released its once mysterious Chrome Operating S...
We talk a lot about social media on Marketing Trenches. And for good reason – Social media seems to...
In CloudBerry Lab we are striving to make our customer service better. In this competitive market wi...
Intel has put out its promised beta SDK for Windows (C and C++) and Moblin (C) developers working on...
InformationWeek stumbled on a Microsoft patent application dating back to 2006 deceptively titled “M...
Berlin-based ThinPrint AG, the printer virtualization house, thinks it’s got a cloud solution for th...
IBM has acquired Guardium, a seven-year-old subsidiary of Israel’s Log-On Software transplanted to M...
But on the web, access to services is implicit in the fact that the business is offering the service...
Behaving like it’s got a future, Sun Monday put out what it calls a significant new version of Virtu...
Oracle has offered to cordon off MySQL inside a combined Oracle-Sun to get the European Commission t...
The second set of charges filed last week against Indian outsourcer Satyam Computer Services founder...
Gartner told Reuters that it overestimated how many PCs Acer shipped in the last seven quarters by a...
Office Web Apps, Microsoft’s answer to Google Apps, are supposed to be out sometime in June along wi...
Gartner thinks the server business has stopped sliding into the abyss. Third-quarter sales weren’t a...
Gartner is buying ~$40 million-a-year AMR Research Inc for close to $64 million in cash. AMD special...
Singed by user reaction to its plans to up the price of its support contracts, SAP Tuesday postponed...
Apparently Google Gears ain’t gonna stick around that long. Google Apps will eventually get their of...
Oracle seems to have divided the open source ranks over the MySQL delay it’s having closing its acqu...
We hear – well, you know how people talk – that Oracle has been quietly meeting with the European Co...