Monday, May 16, 2005

Spam Filters Explained by Alan Hearnshaw



Spam Filters Explained
What do they do? How do they work? Which one is right for me?
By Alan Hearnshaw

Spam is a very real problem that many people have to deal with on a daily basis. For those that have decided to do something about it and start to investigate the options available in spam filtering, this article provides a brief introduction to your options and the types of spam filters available.

Despite the bewildering array of spam filters available today, all claiming to the best one of its kind there are really just five filtering methodologies in general use today and all products rely on one, or a combination of these:

Content-Based Filters
In the beginning, there were content-based filters.

These filters scan the contents of the and look for tell-tale signs that the message is spam. In the early days of spamming it was quite simple to look out for Kill Words such as
Lose Weight and mark a message as spam if it was found.

Very soon though, spammers got wise to this and started resorting to all kinds of tricks to get their message past the filters. The days of obfuscation had begun.
We started getting messages containing the phrase L0se Welght (Notice the zero for o and l for i) and even more bizarreand sometimes quite ingeniousvariations.
This rendered basic content-based filters somewhat ineffective, although there are one or two on the market now that are clever enough to see through theses attempts and still provide good results.

Bayesian Based Filters
The Reverend Bayes comes to the rescue

Born in London 1702, the son of a minister, Thomas Bayes developed a formula which allowed him to determine the probability of an event occurring based on the probabilities of two or more independent evidentiary events.

Bayesian filters learn from studying known good and bad messages. Each message is split into single word bytes, or tokens and these tokens are placed into a database along with how often they are found in each kind of message.
When a new message arrives to be tested by the filter, the new message is also split into tokens and each token is looked up in the database. Extrapolating results from the database and applying a form of the good reverends formula, know as a Naive Bayesian formula, the message is given a spamicity rating and can be dealt with accordingly.

Bayesian filters typically are capable of achieving very good accuracy rates (>97% is not uncommon), and require very little on-going maintenance.

Whitelist/Blacklist Filters
Who goes there, friend or foe?

This very basic form of filtering is seldom used on its own nowadays, but can be useful as part of a larger filtering strategy.

A whitelist is nothing more than a list of e-mail addresses from which you wish to accept communications. A whitelist filter would only accept messages from these people and all others would be rejected

A blacklist, conversely, is a list of e-mail addresses - and sometimes IP Addresses (computer identification addresses) - from which communications will not be accepted.

While this may seem like a good idea from the outset, a whitelist methodology is too restrictive for most people and, as virtually all spam e-mails carry a forged from address, there is little point in collecting this address to ban it in future as it is very unlikely to be the same next time.
There are bodies on the internet that maintain a list of known bad sources of e-mail. Many filters today have the ability to query these servers to see if the message they are looking at comes from a source identified by this Internet-based blacklist, or RBL. While being quite effective, they do tend to suffer from false positives where good messages are incorrectly identified as spam. This happens often with newsletters.

Challenge/Response Filters
Open sesame!

Challenge/Response filters are characterised by their ability to automatically send a response to a previously unknown sender asking them to take some further action before their message will be delivered. This is often referred to as a "Turing Test" - named after a test devised by British mathematician Alan Turing to determine if machines could think.

Recent years have seen the appearance of some internet services which automatically perform this Challenge/Response function for the user and require the sender of an e-mail to visit their web site to facilitate the receipt of their message.

Critics of this system claim it to be too drastic a measure and that it sends a message that "my time is more important than yours" to the people trying to communicate with you.

For some low traffic e-mail users though, this system alone may be a perfectly acceptable method of completely eliminating spam from their inbox - one step above the "Whitelist" system outlined above.

Community Filters
A united front

These types of filters work on the principal of "communal knowledge" of spam. When a user receives a spam message, they simply mark it as such in their filter. This information is sent to a central server where a fingerprint of the message is stored.
After enough people have voted this message to be spam, then it is stopped from reaching all the other people in the community.

This type of filtering can prove to be quite effective, although it stands to reason that it can never be 100% effective as a few people have to receive the spam for it to be flagged in the first place. Just like its similar cousin the Internet black list (RBL), this system also can suffer from false positives, or messages incorrectly identified as spam.

Hopefully you are now armed with a little more information to be able to make an informed decision on the best spam filter for you.
For further information, consider reading the reviews and articles found at http://www.whichspamfilter.com

Alan Hearnshaw is the owner of http://www.whichspamfilter.com, a web site which conducts weekly in-depth reviews of current spam filters, provides help and guidance in the fight against spam and provides a useful community forum.
alan@whichspamfilter.com

About the Author
Alan Hearnshaw is a computer programmer and the owner of http://www.WhichSpamFilter.com, a site which provides weekly in-depth spam filter reviews, user help and guidance and a community forum.

So What Makes a Good Spam Filter Anyway? by Alan Hearnshaw



So What Makes a Good Spam Filter Anyway?
By Alan Hearnshaw

Spam Filters. Most of us know we need one. Some of know we need a better one, but how many stop to think what actually makes a good spam filter in the first place?

This is not just a rhetorical question. It is a question that many usersand many developers - do not ask, and consequently, goes unanswered.

Maybe this could be better answered by defining here the qualities of the perfect spam filter. Well call our perfect spam filter the SpamSplatter 3000. Here are some of the defining qualities of SpamSplatter 3000

1. It requires zero interaction from the user.
2. It produces zero false positives (good messages identified as bad) and zero false negatives (bad messages identified as good).
3. It is transparentthat is, you only ever see good messages and never need even be aware that spam exists.

Thats it. Not much of a shopping list is it?
Of course, SpamSplatter 3000 hasnt been invented yet (and if it does, I want a piece of the action), but it does give us a frame of reference when looking for the best filter we can find.

Lets take each point in turn:

It requires zero interaction from the user
There are two kinds of filters that come near to this ideal currently: Bayesian Filters and Community Filters.
Bayesian filters strip messages down to small word bites, or tokens and maintain a database containing lists of good and bad tokens. When a new message is encountered, the filter strips this message down to tokens, compares it to the database, and applies a formula based on the British scientist Alan Bayes formula for probability calculation.
Over time, the Bayesian filter learns the characteristics of spam messages.

Community Filters simply work on a voting system whereby every user that receives a spam message votes it as spam. This information is stored on a central server and when enough votes are received the message is banned from all users in the community.

As can be seen, the user interaction from these types of filters is mainly limited to two button operationcorrecting wrongly identified messagesand the more accurate the filter, the less those buttons are used.

OK, so thats pretty good. Not exactly zero interaction, but if the filter is accurate enough, then it should be pretty near. That brings us to point two:

It produces zero false positives or negatives
This is the area in which most spam filter development is concentrating and things are getting pretty good nowadays. It is not at all unusual to see an efficient modern filter achieve accuracy of 96% or better. It is, of course, far better to have a false negative than a false positive if you are ever going to tear yourself away from the killed mail folder!

Of course, by definition, community filters cannot reach 100% accuracy as someone has to be getting the spam to be voting it as such!
Theoretically, a Bayesian filter may be able to eventually get quite close to 100% accuracy, so at least there is hope there.
Content based filters (those that look for certain words, phrases or other indicators in a message to identify it as spam), will almost certainly not get much higher accuracy figures than the best of them can achieve today. Adapting to changing spam requires new filters to be created on an ongoing basis.

And finally, we come to the holy grail of spam filtering:

It is transparent
Strangely enough, not enough work seems to be done in trying to achieve this goal. Some of the best filters on the market today identify spam with impressive accuracy and then simply place them in a killed mail folder for your later perusal.
Now, forgive me if Im missing something here, but isnt the point to save you having to wade through the junk mail? Isnt that what you bought the filter for? With the SpamSplatter 3000, you dont need to do that.

As we havent achieved 100% accuracy yet (and probably never will), the only way to free us from checking the killed mail folder is a challenge/response system. This is where a message is automatically sent back to the sender requiring them to take some action for their message to actually be delivered.

Some systems tend to go overboard with the challenge/response system. These systems - often called Whitelist systems - block messages from anyone that isnt in the users friends list. Guaranteed 100% effective, but too drastic a measure for most users.

Now, it seems that the most intelligent use of this system would be to send challenges only to messages that were flagged as questionable. Good message can be delivered, definite spam can be deleted and questionable ones would earn themselves a challenge message.

So, to sum up, lets rewrite the qualities of our perfect filter and get a shopping list of what to look for while we wait for the SpamSplatter 3000 to arrive:

1. Simple, minimal setup and maintenance.
2. Extremely low rate of false positives and as few false negatives as possible.
3. A transparent fail-safe mechanism whereby the victims of those false positives can force the message through to you.

Its simple really. Now, whos going to build me this SpamSplatter 3000?

Alan Hearnshaw is the owner of http://www.WhichSpamFilter.com, a site which provides weekly in-depth spam filter reviews, user help and guidance and a community forum.
alan@whichspamfilter.com


About the Author
Alan Hearnshaw is a computer programmer and the owner of http://www.WhichSpamFilter.com, a site which provides weekly in-depth spam filter reviews, user help and guidance and a community forum.

Anti-Phishing Bill Introduced To Congress by Richard A. Chapo



Sen. Partick J. Leahy has introduced the Anti-Phishing Act
of 2005 to Congress for consideration. The Act would allow
federal prosecutors to seek fines of up to $250,000 and
prison sentences of up to five years against individuals
convicted for promoting phishing scams. Online parody and
political speech sites would be excluded from prosecution.

Phishing is an online scam used to deceive computer users
into giving up personal information such as social security
numbers and passwords. Phishing scams usually involve email
messages requesting the verification of personal information
from a familiar business. Readers are provided a link that
sends them to what appears to be the site of the company in
question. The reader is then asked to verify their account
information by providing their name, address, social
security number, account number, etc.

In truth, the site is an illegal copy of the business in
question and the readers information is collected for later
fraudulent use including identity theft. Consumers are
estimated to lose hundreds of millions of dollars a year to
phishing scams. Undoubtedly, you have received more than a
few of these emails.

Phishing emails are most likely to use the sites of banks,
credit card companies, and large retailers. Online companies
such as Ebay, PayPal and Earthlink have had similar
problems. One particularly aggressive group even scammed the
site of the IRS.

In April 2004, the IRS warned consumers that scam artists
were sending emails purportedly from the IRS. Consumers
received emails claiming they were under investigation for
tax fraud and subject to prosecution. The emails contained
language telling recipients they could help the
investigation by providing real information and directed
them to a website that was derivative of the IRS site.
Consumers were then asked to provide detailed personal
information to dispute the charge. Since most people fear
the IRS, one can assume that a large number of people took
the phishing bait.

Commentary

The Anti-Phishing Act of 2005 is a nice start to combating
scam artists that use phishing to pilfer money from
consumers. The Act, however, will not put an end to
deceptive phishing practices if it is passed. There reason
involves jurisdictional issues.

A large percentage of the individuals promoting phishing
scams reside outside of the United States. While they may
take notice of the law, it will have no discernible effect
on their fraudulent scams. Until there is an international
response, phishing scams will continue to be a problem.
Nonetheless, Senator Leahy should be commended for
initiating efforts to deal with this growing problem.
About the Author
Richard Chapo is the lead attorney for the law firm
http://www.SanDiegoBusinessLawFirm.com - a firm providing
legal advice to California businesses. This article is for
general education purposes and does not address every facet
of the subject matter. Nothing in this article creates an
attorney-client relationship.