How AI weeds the spam out of our inboxes
Of more than 300 billion emails sent every day, at least half are spam. Email providers have the huge task of filtering out spam and making sure their users receive the messages that matter.
Spam detection is messy. The line between spam and non-spam messages is fuzzy, and the criteria change over time. From various efforts to automate spam detection, machine learning has so far proven to be the most effective and favored approach by email providers. Although we still see spammy emails, a quick look at the junk folder will show how much spam gets weeded out of our inboxes every day thanks to machine learning algorithms.
How does machine learning determine which emails are spam and which are not? Here’s an overview of how machine learning-based spam detection works.
Spam email comes in different flavors. Many are just annoying messages aiming to draw attention to a cause or spread false information. Some of them are phishing emails with the intent of luring the recipient into clicking on a malicious link or downloading a malware.
The one thing they have in common is that they are irrelevant to the needs of the recipient. A spam-detector algorithm must find a way to filter out spam while and at the same time avoid flagging authentic messages that users want to see in their inbox. And it must do it in a way that can match evolving trends such as panic caused from pandemics, election news, sudden interest in cryptocurrencies, and others.
Static rules can help. For instance, too many BCC recipients, very short body text, and all caps subjects are some of the hallmarks of spam emails. Likewise, some sender domains and email addresses can be associated with spam. But for the most part, spam detection mainly relies on analyzing the content of the message.
Naïve Bayes machine learning
Machine learning algorithms use statistical models to classify data. In the case of spam detection, a trained machine learning model must be able to determine whether the sequence of words found in an email are closer to those found in spam emails or safe ones.
Different machine learning algorithms can detect spam, but one that has gained appeal is the “naïve Bayes” algorithm. As the name implies, naïve Bayes is based on “Bayes’ theorem,” which describes the probability of an event based on prior knowledge.
The reason it is called “naïve” is that it assumes features of observations are independent. Let’s say you want to use naïve Bayes machine learning to predict whether it will rain or not. In this case, your features could be temperature and humidity, and the event you’re predicting is rainfall.
Naïve Bayes is a very efficient and fast machine learning algorithm, which lent to its popularity in many fields.
In the case of spam detection, things get a bit more complicated. Our target variable is whether a given email is “spam” or “not spam” (also called “ham”). The features are the words or word combinations found in the email’s body. In a nutshell, we want to find out calculate the probability that an email message is spam based on its text.
The catch here is that our features are not necessarily independent. For instance, consider the terms “grilled,” “cheese,” and “sandwich.” They can have separate meanings depending on whether they successively or in different parts of the message. Another example are the words “not” and “interesting.” In this case, the meaning can be completely different depending on where they appear in the message. But even though feature independence is complicated in text data, the naïve Bayes classifier has proven to be efficient in natural language processing tasks if you configure it properly.
Spam detection is a supervised machine learning problem. This means you must provide your machine learning model with a set of examples of spam and ham messages and let it find the relevant patterns that separate the two different categories.
Most email providers have their own vast data sets of labeled emails. For instance, every time you flag an email as spam in your Gmail account, you’re providing Google with training data for its machine learning algorithms. (Note: Google’s spam detection algorithm is much more complicated than what we’re examining here, and the company has mechanisms to prevent abuse of its “Report Spam” feature.)
There are some open-source data sets, such as the spambase data set of the University of California, Irvine, and the Enron spam data set. But these data sets are for educational and test purposes and aren’t of much use in creating production-level machine learning models.
Companies that host their own email servers can easily create specialized data sets that tune their machine learning models to the specific language of their line of work. For instance, the data set of a company that provides financial services will look much different from that of a construction company.
Training the machine learning model
This article was originally published by Ben Dickson on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech and what we need to look out for. You can read the original article here. [LINK]
Published January 3, 2021 — 22:00 UTC