Read Digital Edition


ADS BY GOOGLE
Top Three Links You Must Click On


Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter

The word "FREE" also shows up in both the subject line and message body but, in this case, they're both very guilty indicators of spam. The filter still benefits here because the tokens "FREE" and "Subject*FREE" now have the ability to take up two slots in my decision matrix, further condemning the spam. Header tokens are extremely useful for identifying both spam and legitimate mail.

Other types of header tokens are frequently found to be useful, and the set of delimiters used in the headers is usually slightly different from those used in the message body. For example, if I want to catch all of the IP addresses in the Received: headers, I would treat a period as a constituent character (part of the token) instead of a separator. If I wanted to tokenize the message-id, I'd also include the @ sign as a delimiter, as it is used to separate some pieces of the message-id.

Another advantage of including the header as part of the token is that it helps to create a virtual "whitelist" of users you trust. If I exchange a lot of correspondence with bobsmith@somedomain.com, tokens like "From*bobsmith" and "From*yourcompany.com" will start to appear in the dataset, usually with very innocent values. This works equally well in identifying the hostnames of trusted mail servers in the Received: header too.

URL Optimizations
Everyday innocent-sounding words like "order" and "cgi" often appear in the body of messages I receive from legitimate mailing lists. Seeing them appear in a URL, however, is much more suspicious. URLs are the spammers' preferred means of contact. It's much easier to run a scam using a Web site as your point of contact than it is to pay for the overhead of a phone system or mail processing department. Spammers also like their privacy, since the rest of the free world hates them, and they prefer that even customers not know how to contact them or the companies they spam for. Whether it's a link to click to visit a site or the URL of an image inside the message, URLs provide a lot of useful information specific to their own kind. Even non-sensible numbers will frequently stand out in URLs. This makes really good data for identifying not only spam but some legitimate mailing lists that use URLs in their unsubscribe tag lines. Users who are subscribed to some mailing lists that frequently include embedded advertisements (such as Yahoo Groups) will notice some specific characteristics of the URLs used in these advertisements that help the filter distinguish between advertising and real spam.

URLs are frequently tokenized differently than the rest of a message. The only delimiters usually used when tokenizing a URL are the slash, question mark, equal sign, period, and colon, although some filter authors perform the same basic type of token separation as they do in the rest of the message body. Tokenizing using URL-specific delimiters is done because the individual tokens are more frequently found based on their path in the URL, rather than on a specific context inside the URL. Regardless of how they are tokenized, URLs, when analyzed, can yield a lot of useful information. They can be categorized as places you want to go and places you don't want to go. A spam containing places you don't want to go is just as informative as a legitimate message containing places you do.

Url*getitrightnowwholesale S: 00026 I: 00000 P: 0.9999
Url*thesedealzwontlast S: 00026 I: 00000 P: 0.9999
Url*biz S: 00008 I: 00000 P: 0.9998
Url*us S: 00000 I: 00050 P: 0.0001
Url*java S: 00018 I: 00000 P: 0.9999
Url*www S: 00000 I: 00030 P: 0.0001
Url*com S: 00000 I: 00033 P: 0.0001
Url*img S: 00066 I: 00000 P: 0.9999

Ironically, legitimate URLs seem to be rare among spammers, while the wild and obnoxious names always pop up, with the exception of "java," of course, which appeared as spammy only because this user doesn't use Java (not because Java programmers were spamming). The appearance of certain naming conventions, such as the extensive use of "img," makes the task of identifying malicious URLs pretty easy. If we wanted to, we could probably determine the disposition of the message based on the URL information alone.

Ironically, URLs containing well-known Web addresses are likely to appear as innocent or hapaxes. Not a single URL token containing the following words has ever appeared in my corpus as spammy:

  • Url*microsoft
  • Url*quicken
  • Url*whitehouse
  • Url*intuit
  • Url*sco
  • Url*_amazon
  • Url*linux
  • Url*fbi
  • HTML Tokenization
    One area that has plagued many filter authors is the decision as to what HTML to include and what other parts of the message to ignore - for example, should we ignore JavaScript? What about font tags? Most filters pay attention to all HTML tags except those on an exclusionary list, namely, a specific set of tokens that are common to all types of e-mail. This approach works quite well, but there's still room for improvement. Ignoring data is always something to be concerned about, and you shouldn't do it unless you have good reason. The justification for ignoring some HTML data is that many people normally converse only with senders who do not use HTML. This could cause any type of message with embedded HTML to be rejected as spam, which could be bad for the recipient if their boss suddenly started using an HTML-enabled mail client. The tags most filters ignore include

    • td
    • !doctype
    • blockquote
    • table
    • tr
    • div
    • p
    • body
    • Short tags, with fewer than N characters of content
    • Tags whose content contains no spaces
    About Jonathan A. Zdziarski
    Jonathan A. Zdziarski has been fighting spam for eight years, and has spent a significant portion of the past two years working on the next generation spam filter DSPAM, with up to 99.985% accuracy. Zdziarski lectures widely on the topic of spam.

    In order to post a comment you need to be registered and logged in.

    Register | Sign-in

    Reader Feedback: Page 1 of 1

      Subscribe to our RSS feeds now and receive the next article instantly!
    In It? Reprint It! Contact advertising(at)sys-con.com to order your reprints!
    Subscribe to the World's Most Powerful Newsletters

    ADS BY GOOGLE
    There's a lot of talk about how we need to focus on our buyers' issues and provide them educational ...
    SugarCRM, the world’s leading provider of open source customer relationship management (CRM) softwa...
    This past weekend I set out explore some of the extension capabilities of Google Wave. One of the we...
    More good news for cloud computing! Google last week released its once mysterious Chrome Operating S...
    In CloudBerry Lab we are striving to make our customer service better. In this competitive market wi...
    We talk a lot about social media on Marketing Trenches. And for good reason – Social media seems to...
    Intel has put out its promised beta SDK for Windows (C and C++) and Moblin (C) developers working on...
    InformationWeek stumbled on a Microsoft patent application dating back to 2006 deceptively titled “M...
    Berlin-based ThinPrint AG, the printer virtualization house, thinks it’s got a cloud solution for th...
    Behaving like it’s got a future, Sun Monday put out what it calls a significant new version of Virtu...
    IBM has acquired Guardium, a seven-year-old subsidiary of Israel’s Log-On Software transplanted to M...
    But on the web, access to services is implicit in the fact that the business is offering the service...
    Oracle has offered to cordon off MySQL inside a combined Oracle-Sun to get the European Commission t...
    The second set of charges filed last week against Indian outsourcer Satyam Computer Services founder...
    Gartner told Reuters that it overestimated how many PCs Acer shipped in the last seven quarters by a...
    Office Web Apps, Microsoft’s answer to Google Apps, are supposed to be out sometime in June along wi...
    Gartner thinks the server business has stopped sliding into the abyss. Third-quarter sales weren’t a...
    Gartner is buying ~$40 million-a-year AMR Research Inc for close to $64 million in cash. AMD special...
    Singed by user reaction to its plans to up the price of its support contracts, SAP Tuesday postponed...
    Apparently Google Gears ain’t gonna stick around that long. Google Apps will eventually get their of...