|
SYS-CON Magazines
|
Top Three Links You Must Click On
Product Spotlight Tokenization: The Building Blocks of Spam
Heuristic components of a statistical spam filter
Oct. 3, 2005 03:00 AM
If the tokenizer we're using considers underscores, dashes, and slashes to be token delimiters, then instead of ending up with four one-word tokens, we'll end up with 14 single-character tokens. Many filter authors believe it's healthy to allow these individual characters to tokenize, while others believe that the resulting information is too generalized to be a good indicator of anything, at least without the risk of false positives. Filter authors who share the latter philosophy can use token reassembly to join the original tokens back together. Token reassembly isn't a perfect science, but it provides more useful tokens to work with. The tokens "VIA" and "GRA" are much more useful than individual characters and are definitely more indicative of spam. Token reassembly basically concatenates single-character tokens that are adjacent to one another, looking for larger amounts of white space amidst the slicing and dicing to make an educated guess about what words go together. Since statistical filtering involves machine learning and not human learning, tokens like this are very useful to the computer, even though they may not make much sense to us. For example, the token "VIA" really doesn't mean much, which is exactly why it makes a great indicator of spam - you'd rarely see the word "VIA" in a legitimate message unless you were talking about motherboards. The word "GRA" is even more rare in legitimate mail. The fact that these tokens aren't necessarily comprehensible to a human makes it easier to identify them in spams. My dataset considers some of these fractional words to be extreme indicators of spam:
Agra S: 00030 I: 00000 P: 0.9999
Degeneration
Subject*Free!!! Degeneration has a lot of room for customization, including the order in which the tokens decrease in complexity. At the very least, degeneration of punctuation is a wise move. If the word "free!" doesn't exist in the dataset yet, it makes good sense to use the value from a similar token.
Header Optimizations
From: bazz@xum2.xumx.com In the spam shown here, several different tokens stand out. First, if my e-mail address happened to be bazz@xum2.xumx.com, I wouldn't expect to be seeing it in the From: header, but it would be very normal in the To: header. Seeing my own e-mail address in the From: header would be a clear indicator of spam, since most people don't usually send e-mail to themselves unless they've had too much to drink. Second, the word "Save" appears in both the subject line and the message body. I would expect to see it in the message body more frequently in legitimate mail - for example, "Save your files in the blue folder" or "Save me from this dreaded cubicle." Seeing the word "Save" in the subject header is much more suspicious, though, and it makes sense for me to have a different entry in the dataset for each of them. Reader Feedback: Page 1 of 1
Subscribe to our RSS feeds now and receive the next article instantly!
Subscribe to the World's Most Powerful Newsletters
|
|
||||||||||||||||||||||||||||||||||