mh-e: Junk

 
 19 Dealing With Junk Mail
 *************************
 
 Marshall Rose once wrote a paper on MH entitled, ‘How to process 200
 messages a day and still get some real work done’.  This chapter could
 be entitled, ‘How to process 1000 spams a day and still get some real
 work done’.
 
    We use the terms “junk mail” and “spam” interchangeably for any
 unwanted message which includes spam, “viruses”, and “worms”.  The
 opposite of spam is “ham”.  The act of classifying a sender as one who
 sends junk mail is called “blacklisting”; the opposite is called
 “whitelisting”.
 
 ‘J ?’
      Display cheat sheet for the commands of the current prefix in
      minibuffer (‘mh-prefix-help’).
 ‘J b’
      Blacklist range as spam (‘mh-junk-blacklist’).
 ‘J w’
      Whitelist range as ham (‘mh-junk-whitelist’).
 ‘mh-spamassassin-identify-spammers’
      Identify spammers who are repeat offenders.
 
    The following table lists the options from the ‘mh-junk’
 customization group.
 
 ‘mh-junk-background’
      If on, spam programs are run in background (default: ‘off’).
 ‘mh-junk-disposition’
      Disposition of junk mail (default: ‘Delete Spam’).
 ‘mh-junk-program’
      Spam program that MH-E should use (default: ‘Auto-detect’).
 
    The following option in the ‘mh-sequences’ customization group is
 also available.
 
 ‘mh-whitelist-preserves-sequences-flag’
      On means that sequences are preserved when messages are whitelisted
      (default: ‘on’).
 
    The following hooks are available.
 
 ‘mh-blacklist-msg-hook’
      Hook run by ‘J b’ (‘mh-junk-blacklist’) after marking each message
      for blacklisting (default: ‘nil’).
 ‘mh-whitelist-msg-hook’
      Hook run by ‘J w’ (‘mh-junk-whitelist’) after marking each message
      for whitelisting (default ‘nil’).
 
    The following faces are available.
 
 ‘mh-folder-blacklisted’
      Blacklisted message face.
 ‘mh-folder-whitelisted’
      Whitelisted message face
 
    MH-E depends on SpamAssassin (http://spamassassin.apache.org/),
 bogofilter (http://bogofilter.sourceforge.net/), or SpamProbe
 (http://spamprobe.sourceforge.net/) to throw the dreck away.  This
 chapter describes briefly how to configure these programs to work well
 with MH-E and how to use MH-E’s interface that provides continuing
 education for these programs.
 
    The default setting of the option ‘mh-junk-program’ is ‘Auto-detect’
 which means that MH-E will automatically choose one of SpamAssassin,
 bogofilter, or SpamProbe in that order.  If, for example, you have both
 SpamAssassin and bogofilter installed and you want to use bogofilter,
 then you can set this option to ‘Bogofilter’.
 
    The command ‘J b’ (‘mh-junk-blacklist’) trains the spam program in
 use with the content of the range (SeeRanges) and then handles the
 message(s) as specified by the option ‘mh-junk-disposition’.  By
 default, this option is set to ‘Delete Spam’ but you can also specify
 the name of the folder which is useful for building a corpus of spam for
 training purposes.
 
    In contrast, the command ‘J w’ (‘mh-junk-whitelist’) reclassifies a
 range of messages (SeeRanges) as ham if it were incorrectly
 classified as spam.  It then refiles the message into the ‘+inbox’
 folder.
 
    If a message is in any sequence (except ‘Previous-Sequence:’ and
 ‘cur’) when it is whitelisted, then it will still be in those sequences
 in the destination folder.  If this behavior is not desired, then turn
 off the option ‘mh-whitelist-preserves-sequences-flag’.
 
    By default, the programs are run in the foreground, but this can be
 slow when junking large numbers of messages.  If you have enough memory
 or don’t junk that many messages at the same time, you might try turning
 on the option ‘mh-junk-background’.  (1)
 
    The following sections discuss the various counter-spam measures that
 MH-E can work with.
 
 SpamAssassin
 ------------
 
 SpamAssassin is one of the more popular spam filtering programs.  Get it
 from your local distribution or from the SpamAssassin web site
 (http://spamassassin.apache.org/).
 
    To use SpamAssassin, add the following recipes to ‘~/.procmailrc’:
 
      PATH=$PATH:/usr/bin/mh
      MAILDIR=$HOME/`mhparam Path`
 
      # Fight spam with SpamAssassin.
      :0fw
      | spamc
 
      # Anything with a spam level of 10 or more is junked immediately.
      :0:
      * ^X-Spam-Level: ..........
      /dev/null
 
      :0:
      * ^X-Spam-Status: Yes
      spam/.
 
    If you don’t use ‘spamc’, use ‘spamassassin -P -a’.
 
    Note that one of the recipes above throws away messages with a score
 greater than or equal to 10.  Here’s how you can determine a value that
 works best for you.
 
    First, run ‘spamassassin -t’ on every mail message in your archive
 and use ‘gnumeric’ to verify that the average plus the standard
 deviation of good mail is under 5, the SpamAssassin default for “spam”.
 
    Using ‘gnumeric’, sort the messages by score and view the messages
 with the highest score.  Determine the score which encompasses all of
 your interesting messages and add a couple of points to be conservative.
 Add that many dots to the ‘X-Spam-Level:’ header field above to send
 messages with that score down the drain.
 
    In the example above, messages with a score of 5–9 are set aside in
 the ‘+spam’ folder for later review.  The major weakness of rules-based
 filters is a plethora of false positives so it is worthwhile to check.
 
    If SpamAssassin classifies a message incorrectly, or is unsure, you
 can use the MH-E commands ‘J b’ (‘mh-junk-blacklist’) and ‘J w’
 (‘mh-junk-whitelist’).
 
    The command ‘J b’ (‘mh-junk-blacklist’) adds a ‘blacklist_from’ entry
 to ‘~/spamassassin/user_prefs’, deletes the message, and sends the
 message to the Razor, so that others might not see this spam.  If the
 ‘sa-learn’ command is available, the message is also recategorized as
 spam.
 
    The command‘J w’ (‘mh-junk-whitelist’) adds a ‘whitelist_from’ rule
 to ‘~/.spamassassin/user_prefs’.  If the ‘sa-learn’ command is
 available, the message is also recategorized as ham.
 
    Over time, you’ll observe that the same host or domain occurs
 repeatedly in the ‘blacklist_from’ entries, so you might think that you
 could avoid future spam by blacklisting all mail from a particular
 domain.  The utility function ‘mh-spamassassin-identify-spammers’ helps
 you do precisely that.  This function displays a frequency count of the
 hosts and domains in the ‘blacklist_from’ entries from the last blank
 line in ‘~/.spamassassin/user_prefs’ to the end of the file.  This
 information can be used so that you can replace multiple
 ‘blacklist_from’ entries with a single wildcard entry such as:
 
      blacklist_from *@*amazingoffersdirect2u.com
 
    In versions of SpamAssassin (2.50 and on) that support a Bayesian
 classifier, ‘J b’ ‘(mh-junk-blacklist’) uses the program ‘sa-learn’ to
 recategorize the message as spam.  Neither MH-E, nor SpamAssassin,
 rebuilds the database after adding words, so you will need to run
 ‘sa-learn --rebuild’ periodically.  This can be done by adding the
 following to your ‘crontab’:
 
      0 * * * *       sa-learn --rebuild > /dev/null 2>&1
 
 Bogofilter
 ----------
 
 Bogofilter is a Bayesian spam filtering program.  Get it from your local
 distribution or from the bogofilter web site
 (http://bogofilter.sourceforge.net/).
 
    Bogofilter is taught by running:
 
      bogofilter -n < good-message
 
    on every good message, and
 
      bogofilter -s < spam-message
 
    on every spam message.  This is called a “full training”; three other
 training methods are described in the FAQ that is distributed with
 bogofilter.  Note that most Bayesian filters need 1000 to 5000 of each
 type of message to start doing a good job.
 
    To use bogofilter, add the following recipes to ‘~/.procmailrc’:
 
      PATH=$PATH:/usr/bin/mh
      MAILDIR=$HOME/`mhparam Path`
 
      # Fight spam with Bogofilter.
      :0fw
      | bogofilter -3 -e -p
 
      :0:
      * ^X-Bogosity: Yes, tests=bogofilter
      spam/.
 
      :0:
      * ^X-Bogosity: Unsure, tests=bogofilter
      spam/unsure/.
 
    If bogofilter classifies a message incorrectly, or is unsure, you can
 use the MH-E commands ‘J b’ (‘mh-junk-blacklist’) and ‘J w’
 (‘mh-junk-whitelist’) to update bogofilter’s training.
 
    The ‘Bogofilter FAQ’ suggests that you run the following occasionally
 to shrink the database:
 
      bogoutil -d wordlist.db | bogoutil -l wordlist.db.new
      mv wordlist.db wordlist.db.prv
      mv wordlist.db.new wordlist.db
 
    The ‘Bogofilter tuning HOWTO’ describes how you can fine-tune
 bogofilter.
 
 SpamProbe
 ---------
 
 SpamProbe is a Bayesian spam filtering program.  Get it from your local
 distribution or from the SpamProbe web site
 (http://spamprobe.sourceforge.net).
 
    To use SpamProbe, add the following recipes to ‘~/.procmailrc’:
 
      PATH=$PATH:/usr/bin/mh
      MAILDIR=$HOME/`mhparam Path`
 
      # Fight spam with SpamProbe.
      :0
      SCORE=| spamprobe receive
 
      :0 wf
      | formail -I "X-SpamProbe: $SCORE"
 
      :0:
      *^X-SpamProbe: SPAM
      spam/.
 
    If SpamProbe classifies a message incorrectly, you can use the MH-E
 commands ‘J b’ (‘mh-junk-blacklist’) and ‘J w’ (‘mh-junk-whitelist’) to
 update SpamProbe’s training.
 
 Other Things You Can Do
 -----------------------
 
 There are a couple of things that you can add to ‘~/.procmailrc’ in
 order to filter out a lot of spam and viruses.  The first is to
 eliminate any message with a Windows executable (which is most likely a
 virus).  The second is to eliminate mail in character sets that you
 can’t read.
 
      PATH=$PATH:/usr/bin/mh
      MAILDIR=$HOME/`mhparam Path`
 
      #
      # Filter messages with w32 executables/virii.
      #
      # These attachments are base64 and have a TVqQAAMAAAAEAAAA//8AALg
      # pattern. The string "this program cannot be run in MS-DOS mode"
      # encoded in base64 is 4fug4AtAnNIbg and helps to avoid false
      # positives (Roland Smith via Pete from the bogofilter mailing list).
      #
      :0 B:
      * ^Content-Transfer-Encoding:.*base64
      * ^TVqQAAMAAAAEAAAA//8AALg
      * 4fug4AtAnNIbg
      spam/exe/.
 
      #
      # Filter mail in unreadable character sets (from the Bogofilter FAQ).
      #
      UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987'
 
      :0:
      * 1^0 $ ^Subject:.*=\?($UNREADABLE)
      * 1^0 $ ^Content-Type:.*charset="?($UNREADABLE)
      spam/unreadable/.
 
      :0:
      * ^Content-Type:.*multipart
      * B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE)
      spam/unreadable/.
 
    ---------- Footnotes ----------
 
    (1) Note that the option ‘mh-junk-background’ is used as the
 ‘display’ argument in the call to ‘call-process’.  Therefore, turning on
 this option means setting its value to ‘0’.  You can also set its value
 to ‘t’ to direct the programs’ output to the ‘*MH-E Log*’ buffer; this
 may be useful for debugging.