gnus: Spam Statistics Package
9.17.8 Spam Statistics Package
------------------------------
Paul Graham has written an excellent essay about spam filtering using
statistics: A Plan for Spam (http://www.paulgraham.com/spam.html). In
it he describes the inherent deficiency of rule-based filtering as used
by SpamAssassin, for example: Somebody has to write the rules, and
everybody else has to install these rules. You are always late. It
would be much better, he argues, to filter mail based on whether it
somehow resembles spam or non-spam. One way to measure this is word
distribution. He then goes on to describe a solution that checks
whether a new mail resembles any of your other spam mails or not.
The basic idea is this: Create a two collections of your mail, one
with spam, one with non-spam. Count how often each word appears in
either collection, weight this by the total number of mails in the
collections, and store this information in a dictionary. For every word
in a new mail, determine its probability to belong to a spam or a
non-spam mail. Use the 15 most conspicuous words, compute the total
probability of the mail being spam. If this probability is higher than
a certain threshold, the mail is considered to be spam.
The Spam Statistics package adds support to Gnus for this kind of
filtering. It can be used as one of the back ends of the Spam package
(Spam Package), or by itself.
Before using the Spam Statistics package, you need to set it up.
First, you need two collections of your mail, one with spam, one with
non-spam. Then you need to create a dictionary using these two
collections, and save it. And last but not least, you need to use this
dictionary in your fancy mail splitting rules.
Menu