Re: automated spam detection
- Date: Wed, 17 Feb 99 10:01:05 +0000 (GMT)
Ragnar Lonn writes:
> The thing I didn't like about the idea, however, was that it
> required an SMTP server to connect to the central server once for
> each message it received - of course, it should be done with UDP or
> something but still, the idea was that the SMTP server, when
> receiving a mail, would wait for the response from the central
> server before accepting it for delivery and I think that would lead
> to unacceptable congestion no matter how distributed and how much
> computing power you throw at the centralized server system. To do it
> like you suggest, letting the central server update a client when
> something is spam, would perhaps work better in that respect but it
> would, as I wrote above, mean that the central server decided for
> the participants what was spam and what wasn't and I don't think
> that's something all sites are willing to accept.
The central server doesn't have to use the same policy for all client
sites, of course.
> E.g. site A receives 51 mail messages with identical message bodies
> and is configured to automatically warn its 'friends' about messages
> that appear more than 50 times but it might not consider a message
> spam and start rejecting it before it appears 100 times. It warns
> system B and tells B that "Message AABBCCDD has been seen here 51
> times" which means that B can immediately increase *its* counter for
> that message by 51, which might mean B feels obligated to warn C or
> maybe the counter gets high enough that B starts thinking of that
> message as pure spam. Message body checksums are of course only one
> way of detecting (some) spam. There are lots of other variables that
> could be used.
That sounds like a good idea.
I would say that received notifications should not be passed on - in
the above scenario, A may or may not be in communication with C, and
if B guesses wrong whether it is or not then C may receive the
notification for those 51 messages twice or not at all.
This doesn't mean you have to have a clique (i.e. every host connected
to every other) for correct operation - just that if you don't then
you won't get a 100% accurate picture of how many times a message is
duplicated, just a lower bound, which leads you to erring on the side
of letting messages through.
The point about legitimate mailing lists is still a problem though...
ttfn/rjk