Friday, May 15, 2009

Classifying Spam

For the purposes of this discussion, let "spam" refer to "unsolicited bulk email". Not everyone agrees on this definition, but it's by far the most widely accepted, and without a working definition we won't be able to define "anti-spam". Thus, an email message is spam (for our present purposes) if it meets two criteria [ref: Spamhaus Technical Definition of Spam].

1. Bulk: the recipient's personal identity and context are irrelevant because the message is equally applicable to many other potential recipients.

2. Unsolicited: the recipient has not verifiably granted deliberate, explicit, and still-revocable permission for it to be sent.

It's important to note that both these criteria must hold for a message to be spam. Many legitimate and wanted mailing lists are "bulk" in nature, and some personal communications are not explicitly requested but desired nonetheless. Point number two does not have to hold for every single recipient: the message is spam in those instances where both points hold, and not otherwise. It follows that the exact same message can be spam for one person, and not for another.

The criteria are not highly precise. In point number one, the question of personal relevance has disputable edge cases. In point two, there may be a question as to whether a particular message was covered by the terms of the permission. From this latter observation, it follows that simple whitelists or subscriptions aren't entirely sufficient for expressing bulk mail requests: a recipient may request bulk mail on a particular subject, for example, and justifiably consider messages spam when they stray from that subject. The question as to whether a particular message is on a particular subject also has disputable edge cases.

This lack of precision doesn't prevent us (as human beings) from determining with some confidence whether an item is spam or not. The impersonal (or incorrectly personal) nature of bulk mail usually makes it obvious when point one holds. We can simply ask the question, "is this message personally relevant to me?" Point two is a judgment easily made from knowledge of what we have and have not requested, although disputed edge cases may require arbitration to resolve.

The criteria don't reduce to simple, mechanically-detectable conditions, however. Particularly large email providers can get a genuine idea of bulk delivery (which implies condition one) by comparing incoming messages across accounts, although this technique can be hampered by messages salted with random elements. Permission can only be expressed in its most explicit form—a whitelist—and the terms of permission can't be anything nuanced. Unsolicited messages can be detected at "spamtrap" addresses (which solicit nothing), but we can only surmise that other recipients of the same message did not request it.

No comments:

Post a Comment