Buy the ARP T-Shirt! BIOS Optimization Guide Money Savers!
 

 03 September 2003
 N/A
  N/A
 Rants
 Ken Ng
 1.0
 Discuss here !
 39511
 
   
Desktop Graphics Card Comparison Guide Rev. 33.0
Covering 628 desktop graphics cards, this comprehensive comparison allows you ... Read here
BIOS Option Of The Week - Virtualization Technology
Since 1999, we have been developing the BIOS Optimization Guide, affectionately known... Read here
   
Buy The BOG Book Subscribe To The BOG! Latest Money Savers!
Spam - The Digital Pest
Digg! Reddit!Add to Reddit | Bookmark this article:

How Do Bayesian Spam Filters Work?

There are great articles that delve deep into the logic and mathematics on how these filters work but personally, the math looks like alien language to me! All I need to know is that it works and it works REALLY GOOD!

But simply speaking, this is how a Bayesian spam filter work...

 

Initial Training

Let's assume that you have identified a bunch of good e-mails and a bunch of spams. You start by training the filter to identify all the words that are in the good e-mails and give them a probability of 0 as spam. After working through all the good e-mails, the database will contain a list of "good" words which have a zero (0) probability of being used in spam mails.

Then, the filter scans through all the spam mails to score the words found in them. If the word does not exist in the list of good words, it will be given a score of 1, or a 100% probability of being used in spam. If the word is also found in the good words' list, the filter then gives the word a probability score based on the number of times it appears in the good e-mails and also in spam mails.

Spam Score: 0.999103

word spamprob #ham #spam
'*H*' 0.00049726 - -
'*S*' 0.998702 - -
'tiny' 0.0787768 8 0
'setup' 0.102279 35 1
'hook' 0.136109 4 0
'recording' 0.136109 4 0

Sample of spam scoring by SpamBayes
Click here for the full text

At the end of this initial training, the database will now have a list of words with their probability of usage in spam mails. Now, what happens next?

 

After The Initial Training

Whenever a new mail comes in, the filter uses the new database to classify it. All it does is run through all the words in the e-mail and then give each word a score based on values listed in the database. Once that's done, the filter averages the scores to arrive at an overall score for the e-mail.

If the mail contains a lot of words that have a high probability of being used in spam mails, the overall score will be high. This identifies the e-mail as spam mail.

Of course, if the e-mail contains words that have a low probability of being used in spam mails, then the overall score will be low. This identifies the e-mail as being more likely to be a legitimate e-mail.

Although this technique sounds rather simple, it does produce very surprising results. In fact, after training it with large samples of both legitimate e-mails and spam, Bayesian filters have been found capable of achieving a staggering 99% accuracy in identifying spams! As I write this article, the Bayesian spam filtering algorithm is still being improved upon and already some filters are claiming over 99.8% accuracy!



 
   
The Intel-Micron 3D XPoint Technology Report
The Xiaomi Mi Power Bank (10400 mAh) Teardown
Western Digital Caviar Green (WD20EARX) 2 TB Hard Disk Drive Review
Intel Buys Over NVIDIA
Nokia 5300 XpressMusic Mobile Phone Review
OCZ 1GB Mini-Kart USB Flash Drive Review
OCZ EL PC-3200 Gold DDR Memory Overview
Kingston 1GB Ultra-Low Latency PC3200 HyperX Dual Channel Memory Kit Review
Fantasy Mini Mouse Review
Everglide Ricochet ProSurface 2.52 Mouse Pad Review

 


Copyright © Tech ARP.com. All rights reserved.