gnus: Spam Statistics Package

 
 9.17.8 Spam Statistics Package
 ------------------------------
 
 Paul Graham has written an excellent essay about spam filtering using
 statistics: A Plan for Spam (http://www.paulgraham.com/spam.html).  In
 it he describes the inherent deficiency of rule-based filtering as used
 by SpamAssassin, for example: Somebody has to write the rules, and
 everybody else has to install these rules.  You are always late.  It
 would be much better, he argues, to filter mail based on whether it
 somehow resembles spam or non-spam.  One way to measure this is word
 distribution.  He then goes on to describe a solution that checks
 whether a new mail resembles any of your other spam mails or not.
 
    The basic idea is this: Create a two collections of your mail, one
 with spam, one with non-spam.  Count how often each word appears in
 either collection, weight this by the total number of mails in the
 collections, and store this information in a dictionary.  For every word
 in a new mail, determine its probability to belong to a spam or a
 non-spam mail.  Use the 15 most conspicuous words, compute the total
 probability of the mail being spam.  If this probability is higher than
 a certain threshold, the mail is considered to be spam.
 
    The Spam Statistics package adds support to Gnus for this kind of
 filtering.  It can be used as one of the back ends of the Spam package
 (SeeSpam Package), or by itself.
 
    Before using the Spam Statistics package, you need to set it up.
 First, you need two collections of your mail, one with spam, one with
 non-spam.  Then you need to create a dictionary using these two
 collections, and save it.  And last but not least, you need to use this
 dictionary in your fancy mail splitting rules.
 

Menu