How Do Bayesian Spam Filters Work?
There are great articles that delve deep into the logic and mathematics on how these filters work but personally, the math looks like alien language to me! All I need to know is that it works and it works REALLY GOOD!
But simply speaking, this is how a Bayesian spam filter work...
Initial Training
Let's assume that you have identified a bunch of good e-mails and a bunch of spams. You start by training the filter to identify all the words that are in the good e-mails and give them a probability of 0 as spam. After working through all the good e-mails, the database will contain a list of "good" words which have a zero (0) probability of being used in spam mails.
Then, the filter scans through all the spam mails to score the words found in them. If the word does not exist in the list of good words, it will be given a score of 1, or a 100% probability of being used in spam. If the word is also found in the good words' list, the filter then gives the word a probability score based on the number of times it appears in the good e-mails and also in spam mails.
Spam Score: 0.999103
word spamprob #ham #spam |
At the end of this initial training, the database will now have a list of words with their probability of usage in spam mails. Now, what happens next?
After The Initial Training
Whenever a new mail comes in, the filter uses the new database to classify it. All it does is run through all the words in the e-mail and then give each word a score based on values listed in the database. Once that's done, the filter averages the scores to arrive at an overall score for the e-mail.
If the mail contains a lot of words that have a high probability of being used in spam mails, the overall score will be high. This identifies the e-mail as spam mail.
Of course, if the e-mail contains words that have a low probability of being used in spam mails, then the overall score will be low. This identifies the e-mail as being more likely to be a legitimate e-mail.
Although this technique sounds rather simple, it does produce very surprising results. In fact, after training it with large samples of both legitimate e-mails and spam, Bayesian filters have been found capable of achieving a staggering 99% accuracy in identifying spams! As I write this article, the Bayesian spam filtering algorithm is still being improved upon and already some filters are claiming over 99.8% accuracy!
<<< Is Fighting Spam A Lost Cause? : Previous Page | Next Page : Is This The Death Knell For Spams? >>>