Read Digital Edition


ADS BY GOOGLE
Top Three Links You Must Click On


Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter

It is probably better to use an exclusionary list rather than an inclusionary one. You're more likely to miss a few tags or possibly to fail to name certain tags you never thought could be used in spam (for example, the object tag has recently become popular). If this happens, at worst the tag will sit and collect dust in the dataset with some neutral value or will fill up a decision matrix slot in error. If you fail to add a tag to an inclusive list, though, you're bound to ignore an important data point and may not even realize it.

Some of the HTML tags commonly used by spammers (which a filter should definitely be looking at) include the following:


APPLET BGSOUND FRAME IFRAME
ILAYER IMG INPUT LAYER
LINK SCRIPT A AREA
BASE DIV LINK SPAN
OBJEC FONT BODY META
Some filters like to mark the tokens generated from HTML tags with an "HTML" identifier, while others go so far as to mark the particular tag the text belonged to (for example, "BODY:BGCOLOR=#FFFFFF"). Regardless of which tags the filter decides to keep and which get discarded, it's very important to handle HTML comments correctly. Spammers are using many tricks to obfuscate their text so that it's human readable, but not very machine readable. For example, the following may look like a complete mess in its machine-readable format:

Received: from 64.202.131.2 (h0007e9075130.ne.client2.attbi.com
[24.218.222.43])
Message-ID: <cp6-mh-rn-w$4pa2o965rl84@jn4y0hq1bcy>
From: "patsy stamm" <arthropathology71255@earthlink.net>
Reply-To: "patsy stamm" <arthropathology71255@earthlink.net>
Subject: Giving this to you
Date: Fri, 08 Aug 03 07:29:02 GMTX-Mailer: MIME-tools 5.503 (Entity 5.501)
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="AD0E55.76_15.C" X-Priority: 3 X-MSMail-Priority: Normal
--AD0E55.76_15.C
Content-Type: text/html;
Content-Transfer-Encoding: quoted-printable

Yes you he<!lansing>ard about th<!crossbill>ese weird <!cottony>little pil<!domesday>ls

that are suppo<!=anabel>sed to make you bigger and of cou<!chord>rse you think they're b<!soften>ogus snake potion. Well, let's look at the facts:
<strong>G<!eigenspace>RX2

has be<!waldron>en sold over 1.9 Mill<!audacity>ion times within the last 18 months</strong>...
With awe<!tapestry>some results for hun<!wield>dreds of thous<!locale>ands of men all over the planet! They all enjoy a seriously enhanced version of their manh<!rescind>ood and <b>why shou<!seoul>ldn't you</b>?

But when the user clicks the message to read it, the HTML comments won't be visible and the user will see this:

Yes you heard about these weird little pills that are supposed to make you bigger and of course you think they're bogus snake potion. Well, let's look at the facts: GRX2 has been sold over 1.9 Million times within the last 18 months... With awesome results for hundreds of thousands of men all over the planet! They all enjoy a seriously enhanced version of their manhood and why shouldn't you?

A simple way to ensure that the message is tokenized correctly is to remove the HTML comments and reassemble the message.

Word Pairs
Using word pairs, or nGrams, has recently become very popular among authors of statistical filters and adds a lot of benefits to standard single-token filtering. Pairing words together creates more specialized tokens. For example, the word "play" could be considered a very neutral word, as it could be used to describe a lot of different things. But pairing it with the word adjacent to it will give us a token that will inevitably stick out more when it occurs - for example, "play lotto." This approach helps improve the processing of HTML components by identifying the different types of generators used to create the HTML messages. Each generator, whether it's a legitimate mail client or a spam tool, has its own unique signature, which joining tokens together can help to highlight. Tokenizers that implement these types of approaches are referred to as concept-based tokenizers, because they identify concepts in addition to content.

Sparse Binary Polynomial Hashing
Bill Yerazunis originally introduced the concept known as SBPH, or sparse binary polynomial hashing. SBPH is an approach to tokenization using word pairs and phrases. If it wasn't so effective at what it does, it would probably be a terrible idea, but Yerazunis has repeatedly astonished the spam-filtering community with the leaps in accuracy made by SBPH tokenization. Graham refers to SBPH with the same mixed feelings regarding its ingenuity and need for medication.

Another project I heard about . . . was Bill Yerazunis' CRM114. This is the counterexample to the design principle I just mentioned. It's a straight text classifier, but such a stunningly effective one that it manages to filter spam almost perfectly without even knowing that's what it's doing.

SBPH tokenizes entire phrases, up to five tokens across, and allows for word skipping in between. It led the way in terms of accuracy for a long period of time, but it also created an enormous amount of data, which is one of the reasons it presently functions only in a train-on-error environment. SBPH provides the benefit of using the simplest, most colloquial tokens but giving special notice to more complex tokens as well, which are usually much stronger indicators of spam when they appear.

About Jonathan A. Zdziarski
Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1

  Subscribe to our RSS feeds now and receive the next article instantly!
In It? Reprint It! Contact advertising(at)sys-con.com to order your reprints!
Subscribe to the World's Most Powerful Newsletters

ADS BY GOOGLE
But on the web, access to services is implicit in the fact that the business is offering the service...
Intel has put out its promised beta SDK for Windows (C and C++) and Moblin (C) developers working on...
Behaving like it’s got a future, Sun Monday put out what it calls a significant new version of Virtu...
InformationWeek stumbled on a Microsoft patent application dating back to 2006 deceptively titled “M...
Berlin-based ThinPrint AG, the printer virtualization house, thinks it’s got a cloud solution for th...
The second set of charges filed last week against Indian outsourcer Satyam Computer Services founder...
IBM has acquired Guardium, a seven-year-old subsidiary of Israel’s Log-On Software transplanted to M...
Gartner told Reuters that it overestimated how many PCs Acer shipped in the last seven quarters by a...
Oracle has offered to cordon off MySQL inside a combined Oracle-Sun to get the European Commission t...
Office Web Apps, Microsoft’s answer to Google Apps, are supposed to be out sometime in June along wi...
Gartner thinks the server business has stopped sliding into the abyss. Third-quarter sales weren’t a...
Gartner is buying ~$40 million-a-year AMR Research Inc for close to $64 million in cash. AMD special...
Singed by user reaction to its plans to up the price of its support contracts, SAP Tuesday postponed...
Apparently Google Gears ain’t gonna stick around that long. Google Apps will eventually get their of...
Oracle seems to have divided the open source ranks over the MySQL delay it’s having closing its acqu...
The Korean government is going to sink around $172 million into cloud computing next year under a st...
We hear – well, you know how people talk – that Oracle has been quietly meeting with the European Co...
In response to Opera’s complaints Microsoft has reportedly modified the proposed ballot screen that’...
Microsoft has sold the Folio and NXT businesses it got when it bought Fast Search and Transfer, the ...
CA is looking for talent in EMEA: associate account managers, directors of solution sales, senior so...