automated spam detection
- Date: Tue, 16 Feb 99 15:22:10 +0000 (GMT)
These thoughts derive from some recent discussion I spotted on
Usenet...
One of the problems we have with spam is that an individual has no way
of telling whether a piece of mail they have received was part of a
mailshot of a million messages, or was sent to them alone.
I'd usually be annoyed by receiving UCE even if it wasn't sent to
anyone but me - but a single message wouldn't represent a major cost
for ISPs in the way bulk mail can do.
In the case of Usenet spam, the problem is easy: a server typically
sees all the messages, and there are automated spam detection systems
which spot and (in some sense) kill spam.
It'd be useful to have the same kind of ability with email. To make
this work there would have to be a central location where MTAs could
register, in real time, some kind of identifying information about the
messages they received: the MD5 hash of the body and selected header
fields might be a good place to start.
Once enough instances of the same hash had been registered, the
central server could notify subscribing MTAs that a particular message
was deemed bulk mail, and they could then choose to ignore it.
If you like, this is the equivalent of the RBL, but instead of
blackholing particular hosts, it blackholes particular messages. (So
some of the arguments for and against existing blacklists are also
arguments for and against this scheme.)
There would probably be multiple "central" locations - more than one
per country wouldn't surprise me.
What are the potential problems of this approach?
There is the extra load on mail relays. CPU is pretty cheap these
days, and my gut feeling is that the overhead of calculation the hash
of every message would not be prohibitive; but experimentation may be
needed.
Then there is the extra network load as all the relays communicate
with the central server. Actually this does not have to be such a
problem - the protocol can be lossy provided it is not *too* lossy; if
10% of packets don't get through then you just have to reduce your
thresholds by 10% to get the same results as if all packets passed.
You could simply one (or more) hashes you'd seen in a UDP packet and
send it off, and not care if it arrived. That could amount to less
than 1 extra IP packet per email message you received - it might be no
more extra bandwidth than the ESMTP banner you already send.
The announcements that particular hashes are part of a bulk mailing
could perhaps be done by multicast??
There are privacy questions. You certainly don't want to leak
information about what all your email messages actually are, hence my
suggestion of MD5.
But would users mind the fact that a number identifying the email they
received was being sent to some third party, even if they were told
that it was completely impractical to actually determine what the real
message was?
Also an ISP using this scheme would in effect be telling the central
server what level of mail traffic it carried, which some might
consider commercially sensitive information.
Comments?
ttfn/rjk