Tech ARP - Spam - The Digital Pest

	Date :	03 September 2003
	Manufacturer :	N/A
	Source :	N/A
	Category :	Rants
	Author :	Ken Ng
	Revision :	1.0
	Forum Link :	Discuss here !
	Views :	39511

			Desktop Graphics Card Comparison Guide Rev. 33.0 Covering 628 desktop graphics cards, this comprehensive comparison allows you ... Read here

			BIOS Option Of The Week - Virtualization Technology Since 1999, we have been developing the BIOS Optimization Guide, affectionately known... Read here


Buy The BOG Book	Subscribe To The BOG!	Latest Money Savers!

Spam - The Digital Pest

Add to Reddit | Bookmark this article:

How Do Bayesian Spam Filters Work?

There are great articles that delve deep into the logic and mathematics on how these filters work but personally, the math looks like alien language to me! All I need to know is that it works and it works REALLY GOOD!

But simply speaking, this is how a Bayesian spam filter work...

Initial Training

Let's assume that you have identified a bunch of good e-mails and a bunch of spams. You start by training the filter to identify all the words that are in the good e-mails and give them a probability of 0 as spam. After working through all the good e-mails, the database will contain a list of "good" words which have a zero (0) probability of being used in spam mails.

Then, the filter scans through all the spam mails to score the words found in them. If the word does not exist in the list of good words, it will be given a score of 1, or a 100% probability of being used in spam. If the word is also found in the good words' list, the filter then gives the word a probability score based on the number of times it appears in the good e-mails and also in spam mails.

Spam Score: 0.999103

word spamprob #ham #spam
'*H*' 0.00049726 - -
'*S*' 0.998702 - -
'tiny' 0.0787768 8 0
'setup' 0.102279 35 1
'hook' 0.136109 4 0
'recording' 0.136109 4 0

Sample of spam scoring by SpamBayes
Click here for the full text

At the end of this initial training, the database will now have a list of words with their probability of usage in spam mails. Now, what happens next?

After The Initial Training

Whenever a new mail comes in, the filter uses the new database to classify it. All it does is run through all the words in the e-mail and then give each word a score based on values listed in the database. Once that's done, the filter averages the scores to arrive at an overall score for the e-mail.

If the mail contains a lot of words that have a high probability of being used in spam mails, the overall score will be high. This identifies the e-mail as spam mail.

Of course, if the e-mail contains words that have a low probability of being used in spam mails, then the overall score will be low. This identifies the e-mail as being more likely to be a legitimate e-mail.

Although this technique sounds rather simple, it does produce very surprising results. In fact, after training it with large samples of both legitimate e-mails and spam, Bayesian filters have been found capable of achieving a staggering 99% accuracy in identifying spams! As I write this article, the Bayesian spam filtering algorithm is still being improved upon and already some filters are claiming over 99.8% accuracy!

<<< Is Fighting Spam A Lost Cause? : Previous Page | Next Page : Is This The Death Knell For Spams? >>>

< Is Fighting Spam A Lost Cause? : Previous Page

Next Page : Is This The Death Knell For Spams? >