mh-e: Junk
19 Dealing With Junk Mail
*************************
Marshall Rose once wrote a paper on MH entitled, ‘How to process 200
messages a day and still get some real work done’. This chapter could
be entitled, ‘How to process 1000 spams a day and still get some real
work done’.
We use the terms “junk mail” and “spam” interchangeably for any
unwanted message which includes spam, “viruses”, and “worms”. The
opposite of spam is “ham”. The act of classifying a sender as one who
sends junk mail is called “blacklisting”; the opposite is called
“whitelisting”.
‘J ?’
Display cheat sheet for the commands of the current prefix in
minibuffer (‘mh-prefix-help’).
‘J b’
Blacklist range as spam (‘mh-junk-blacklist’).
‘J w’
Whitelist range as ham (‘mh-junk-whitelist’).
‘mh-spamassassin-identify-spammers’
Identify spammers who are repeat offenders.
The following table lists the options from the ‘mh-junk’
customization group.
‘mh-junk-background’
If on, spam programs are run in background (default: ‘off’).
‘mh-junk-disposition’
Disposition of junk mail (default: ‘Delete Spam’).
‘mh-junk-program’
Spam program that MH-E should use (default: ‘Auto-detect’).
The following option in the ‘mh-sequences’ customization group is
also available.
‘mh-whitelist-preserves-sequences-flag’
On means that sequences are preserved when messages are whitelisted
(default: ‘on’).
The following hooks are available.
‘mh-blacklist-msg-hook’
Hook run by ‘J b’ (‘mh-junk-blacklist’) after marking each message
for blacklisting (default: ‘nil’).
‘mh-whitelist-msg-hook’
Hook run by ‘J w’ (‘mh-junk-whitelist’) after marking each message
for whitelisting (default ‘nil’).
The following faces are available.
‘mh-folder-blacklisted’
Blacklisted message face.
‘mh-folder-whitelisted’
Whitelisted message face
MH-E depends on SpamAssassin (http://spamassassin.apache.org/),
bogofilter (http://bogofilter.sourceforge.net/), or SpamProbe
(http://spamprobe.sourceforge.net/) to throw the dreck away. This
chapter describes briefly how to configure these programs to work well
with MH-E and how to use MH-E’s interface that provides continuing
education for these programs.
The default setting of the option ‘mh-junk-program’ is ‘Auto-detect’
which means that MH-E will automatically choose one of SpamAssassin,
bogofilter, or SpamProbe in that order. If, for example, you have both
SpamAssassin and bogofilter installed and you want to use bogofilter,
then you can set this option to ‘Bogofilter’.
The command ‘J b’ (‘mh-junk-blacklist’) trains the spam program in
use with the content of the range (Ranges) and then handles the
message(s) as specified by the option ‘mh-junk-disposition’. By
default, this option is set to ‘Delete Spam’ but you can also specify
the name of the folder which is useful for building a corpus of spam for
training purposes.
In contrast, the command ‘J w’ (‘mh-junk-whitelist’) reclassifies a
range of messages (Ranges) as ham if it were incorrectly
classified as spam. It then refiles the message into the ‘+inbox’
folder.
If a message is in any sequence (except ‘Previous-Sequence:’ and
‘cur’) when it is whitelisted, then it will still be in those sequences
in the destination folder. If this behavior is not desired, then turn
off the option ‘mh-whitelist-preserves-sequences-flag’.
By default, the programs are run in the foreground, but this can be
slow when junking large numbers of messages. If you have enough memory
or don’t junk that many messages at the same time, you might try turning
on the option ‘mh-junk-background’. (1)
The following sections discuss the various counter-spam measures that
MH-E can work with.
SpamAssassin
------------
SpamAssassin is one of the more popular spam filtering programs. Get it
from your local distribution or from the SpamAssassin web site
(http://spamassassin.apache.org/).
To use SpamAssassin, add the following recipes to ‘~/.procmailrc’:
PATH=$PATH:/usr/bin/mh
MAILDIR=$HOME/`mhparam Path`
# Fight spam with SpamAssassin.
:0fw
| spamc
# Anything with a spam level of 10 or more is junked immediately.
:0:
* ^X-Spam-Level: ..........
/dev/null
:0:
* ^X-Spam-Status: Yes
spam/.
If you don’t use ‘spamc’, use ‘spamassassin -P -a’.
Note that one of the recipes above throws away messages with a score
greater than or equal to 10. Here’s how you can determine a value that
works best for you.
First, run ‘spamassassin -t’ on every mail message in your archive
and use ‘gnumeric’ to verify that the average plus the standard
deviation of good mail is under 5, the SpamAssassin default for “spam”.
Using ‘gnumeric’, sort the messages by score and view the messages
with the highest score. Determine the score which encompasses all of
your interesting messages and add a couple of points to be conservative.
Add that many dots to the ‘X-Spam-Level:’ header field above to send
messages with that score down the drain.
In the example above, messages with a score of 5–9 are set aside in
the ‘+spam’ folder for later review. The major weakness of rules-based
filters is a plethora of false positives so it is worthwhile to check.
If SpamAssassin classifies a message incorrectly, or is unsure, you
can use the MH-E commands ‘J b’ (‘mh-junk-blacklist’) and ‘J w’
(‘mh-junk-whitelist’).
The command ‘J b’ (‘mh-junk-blacklist’) adds a ‘blacklist_from’ entry
to ‘~/spamassassin/user_prefs’, deletes the message, and sends the
message to the Razor, so that others might not see this spam. If the
‘sa-learn’ command is available, the message is also recategorized as
spam.
The command‘J w’ (‘mh-junk-whitelist’) adds a ‘whitelist_from’ rule
to ‘~/.spamassassin/user_prefs’. If the ‘sa-learn’ command is
available, the message is also recategorized as ham.
Over time, you’ll observe that the same host or domain occurs
repeatedly in the ‘blacklist_from’ entries, so you might think that you
could avoid future spam by blacklisting all mail from a particular
domain. The utility function ‘mh-spamassassin-identify-spammers’ helps
you do precisely that. This function displays a frequency count of the
hosts and domains in the ‘blacklist_from’ entries from the last blank
line in ‘~/.spamassassin/user_prefs’ to the end of the file. This
information can be used so that you can replace multiple
‘blacklist_from’ entries with a single wildcard entry such as:
blacklist_from *@*amazingoffersdirect2u.com
In versions of SpamAssassin (2.50 and on) that support a Bayesian
classifier, ‘J b’ ‘(mh-junk-blacklist’) uses the program ‘sa-learn’ to
recategorize the message as spam. Neither MH-E, nor SpamAssassin,
rebuilds the database after adding words, so you will need to run
‘sa-learn --rebuild’ periodically. This can be done by adding the
following to your ‘crontab’:
0 * * * * sa-learn --rebuild > /dev/null 2>&1
Bogofilter
----------
Bogofilter is a Bayesian spam filtering program. Get it from your local
distribution or from the bogofilter web site
(http://bogofilter.sourceforge.net/).
Bogofilter is taught by running:
bogofilter -n < good-message
on every good message, and
bogofilter -s < spam-message
on every spam message. This is called a “full training”; three other
training methods are described in the FAQ that is distributed with
bogofilter. Note that most Bayesian filters need 1000 to 5000 of each
type of message to start doing a good job.
To use bogofilter, add the following recipes to ‘~/.procmailrc’:
PATH=$PATH:/usr/bin/mh
MAILDIR=$HOME/`mhparam Path`
# Fight spam with Bogofilter.
:0fw
| bogofilter -3 -e -p
:0:
* ^X-Bogosity: Yes, tests=bogofilter
spam/.
:0:
* ^X-Bogosity: Unsure, tests=bogofilter
spam/unsure/.
If bogofilter classifies a message incorrectly, or is unsure, you can
use the MH-E commands ‘J b’ (‘mh-junk-blacklist’) and ‘J w’
(‘mh-junk-whitelist’) to update bogofilter’s training.
The ‘Bogofilter FAQ’ suggests that you run the following occasionally
to shrink the database:
bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
mv wordlist.db wordlist.db.prv
mv wordlist.db.new wordlist.db
The ‘Bogofilter tuning HOWTO’ describes how you can fine-tune
bogofilter.
SpamProbe
---------
SpamProbe is a Bayesian spam filtering program. Get it from your local
distribution or from the SpamProbe web site
(http://spamprobe.sourceforge.net).
To use SpamProbe, add the following recipes to ‘~/.procmailrc’:
PATH=$PATH:/usr/bin/mh
MAILDIR=$HOME/`mhparam Path`
# Fight spam with SpamProbe.
:0
SCORE=| spamprobe receive
:0 wf
| formail -I "X-SpamProbe: $SCORE"
:0:
*^X-SpamProbe: SPAM
spam/.
If SpamProbe classifies a message incorrectly, you can use the MH-E
commands ‘J b’ (‘mh-junk-blacklist’) and ‘J w’ (‘mh-junk-whitelist’) to
update SpamProbe’s training.
Other Things You Can Do
-----------------------
There are a couple of things that you can add to ‘~/.procmailrc’ in
order to filter out a lot of spam and viruses. The first is to
eliminate any message with a Windows executable (which is most likely a
virus). The second is to eliminate mail in character sets that you
can’t read.
PATH=$PATH:/usr/bin/mh
MAILDIR=$HOME/`mhparam Path`
#
# Filter messages with w32 executables/virii.
#
# These attachments are base64 and have a TVqQAAMAAAAEAAAA//8AALg
# pattern. The string "this program cannot be run in MS-DOS mode"
# encoded in base64 is 4fug4AtAnNIbg and helps to avoid false
# positives (Roland Smith via Pete from the bogofilter mailing list).
#
:0 B:
* ^Content-Transfer-Encoding:.*base64
* ^TVqQAAMAAAAEAAAA//8AALg
* 4fug4AtAnNIbg
spam/exe/.
#
# Filter mail in unreadable character sets (from the Bogofilter FAQ).
#
UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987'
:0:
* 1^0 $ ^Subject:.*=\?($UNREADABLE)
* 1^0 $ ^Content-Type:.*charset="?($UNREADABLE)
spam/unreadable/.
:0:
* ^Content-Type:.*multipart
* B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE)
spam/unreadable/.
---------- Footnotes ----------
(1) Note that the option ‘mh-junk-background’ is used as the
‘display’ argument in the call to ‘call-process’. Therefore, turning on
this option means setting its value to ‘0’. You can also set its value
to ‘t’ to direct the programs’ output to the ‘*MH-E Log*’ buffer; this
may be useful for debugging.