Big Data

Using Machine Learning to Stop Fake News

Bill Schmarzo By Bill Schmarzo CTO, Dell EMC Services (aka “Dean of Big Data”) July 17, 2017

Given all the brilliant things that are happening today with machine learning and artificial intelligence, I just don’t understand why “fake news” is still an issue. I think the solution is right in front of us; that is, if social media networks are really serious about addressing this problem.

Facebook is one of the biggest culprits in tolerating fake news, and that probably has a lot to do with the “economics of social engagement.” An article titled “Future of Social Media” summarizes the challenge nicely:

“While it’s great that everyone and her brother has access to create content online, offering a more diverse and thriving online market, this also generates stronger competition for your content to break through the clutter and be seen.

In fact, there will be a time in which the amount of content internet users can consume will be outweighed by the amount of content produced. Schaefer calls this “Content Shock” which, unfortunately, is uneconomical.”

Figure 1 shows the area of “Content Shock,” when the ability to create content outstrips the ability for humans to consume it.

Economic Content and Reason for Machine Learning

Figure 1: Economics of Content and “Content Shock”

The article recommends to “create content that will stand out” in order to draw attention and create engagement. Well, nothing draws attention and creates engagement like “fake news”. For example, here are some examples of fake news articles and the number of Facebook engagements each of these articles drove[1]:

  • “Pope Francis shocks world, endorses Donald Trump for president” – 960,000 Facebook engagements
  • “WikiLeaks confirms Hillary sold weapons to ISIS … Then drops another bombshell” – 789,000 Facebook engagements
  • “FBI agent suspected in Hillary email leaks found dead in apartment murder-suicide” – 567,000 Facebook engagements

That’s an awful lot of Facebook engagements with news that isn’t true, but the “news” certainly does “stand out” in the crowded content space and it certainly does drive engagement.

Solving the Fake News Problem

So assuming that the social media networks truly are motivated to solve the “fake news” problem, here is how I would do it.

  • Step 1: Leverage crowdsourcing to flag potential fake news articles. Social media networks could create a “Fake News” button that flags potential fake news, like Yahoo Mail does today to flag potential spam (see Figure 2).
Figure 2:  Flagging Potential Email Spam in Yahoo Mail

Figure 2:  Flagging Potential Email Spam in Yahoo Mail

  • Step 2: Human Reviewers would need to review the flagged “Fake News” articles to determine which ones are fake and which ones are not fake. Maybe the Reviews could even add additional information (metadata data?) that captures information such as “degree of fakeness” (i.e., is it an outright lie or is it just a slight twisting of the facts) and “severity of fakeness” (i.e., fake news about a celebrity isn’t nearly as severe as fake news about a political candidate. Heck, there are certain celebrities whose fame seems to be based entirely upon fake news… the Kardashians?).
  • Step 3: Apply Supervised Machine Learning algorithms against the flagged potential “fake news” articles to find (quantify) correlations and predictors (i.e., combinations of words, phrases and topics) of “fake news” outcomes. Then use the resulting “fake news” models on new articles to score the article’s “level of fakeness.” Remember, Supervised Machine Learning algorithms identify and quantify relationships between potential predictive variables and metrics against known outcomes (e.g., spam, fraudulent transaction, machine failure, web click, purchase transaction) gathered from historical (training) data sets and then applies the models to new data sets.
  • Step 4: Create “Reader Credibility Scores” to rank credibility of people flagging fake news articles. It is critical to create reader credibility scores (think FICO score or Uber driver and passenger scores) to measure the integrity of folks who are flagging potential fake news (as well as those that are also promoting fake news). That will help to identify “trolls[2]” who are just trying to perpetuate the fake stories or cast doubt on real news.

Amazon already supports the flagging of potential “Trolls” and “fake reviews” in their customer reviews (see Figure 3).

Figure 3:  Flagging Fake Reviews

Figure 3:  Flagging Fake Reviews

  • Step 5: Create “Publisher Credibility Scores” that measures the credibility and reliability of each publisher or source of the article. This score would be comprised of the results of the fakeness analysis (how many fake articles is that publisher responsible for) but could also include other variables such as number of employees working for the publisher and tenure in the business (e.g., Wall Street Journal has around 3,600 employees and has been publishing since 1851 versus Liberty Writers News which has 2 employees and has been publishing since only 2015). Heck, there is even a Wikipedia page “List of fake news websites” that lists known fake news sites, such as Liberty Writers News, American News, Disclose TV, Drudgereport.com and World Truth TV.

Freedom of Speech and Type I/Type II Errors

Machine Learning could certainly help to mitigate and flag fake news, but probably cannot and should not even try to eliminate it entirely. Why? It’s the First Amendment of the Constitution and it’s called Freedom of Speech.

One important consideration as social media organizations look to squelch fake news is to not violate Freedom of Speech. So instead of an outright deletion of questionable publications (other than for pornographic, liable or hate crime reasons), it might be better for the social media sits to use some sort of “Degrees of Truth” indicator that could accompany each publication or article. These indicators might look like something in Figure 4.

Figure 4:  Degrees of Truthfulness Indicators

Figure 4:  Degrees of Truthfulness Indicators

The cost to society of letting a few fake news articles to get published (false positive) greatly outweighs the potential costs of blocking potentially valid news (false negatives). So one will need to err on the side of allowing some level of fake news to ensure that one is not blocking real (though maybe controversial) news. See my blog “Understanding Type I and Type II Errors” to learn more about the potential costs and liabilities associated with Type I and Type II errors.

Machine Learning to End of Fake News

Ending Fake News seems like the perfect application of machine learning. Organizations like Yahoo, Google and Microsoft have been using machine learning for years now to catch spam (see article “Google Says Its AI Catches 99.9 Percent Of Gmail Spam”.)  And companies like McAfee and Symantec employee machine learning to catch viruses (see article “Malware Detection with Machine Learning Methods”.)

Fake news looks a lot like spam and a virus to me. Should be an easy problem to solve, if one really wants to.

[1] http://www.cnbc.com/2016/12/30/read-all-about-it-the-biggest-fake-news-stories-of-2016.html

[2] A troll is a person who sows discord on the Internet by starting arguments or upsetting people, by posting inflammatory, extraneous, or off-topic messages with the intent of provoking readers into an emotional response or of otherwise disrupting normal, on-topic discussion. https://en.wikipedia.org/wiki/Internet_troll

Bill Schmarzo

About Bill Schmarzo


CTO, Dell EMC Services (aka “Dean of Big Data”)

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Dell EMC’s Big Data Practice. As a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Read More

Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

2 thoughts on “Using Machine Learning to Stop Fake News

  1. Journalism was my passion until the 24-hour-forced news cycle produced writers that have no respect for the consequences of their written word, on not only the people they are writing about, but as to the continued effect on people as their shared content never dies. As an employee in the data field I do believe your solution is innovative and workable – if responsible non-biased people take on traditional editorial responsibilities. It will not work with editors that wish to continue to drive their own agendas with the power and knowledge to make themselves heard. Without the technology available to sort Big Data and its various components and without a non-biased representative of the population, online news will never regain the respect of the 4th Estate – and these social media sites wishing to become news outlets have a long road ahead to show they can develop something other than, at best, the “Editorial Section” and at worst “Yellow Journalism.”

    • Thanks Jennifer for your heartfelt words. It is sad to see that an honorable career in journalism is being driven to extinction by unscrupulous editors who care more about views and eyeballs, then the truth. Data and analytics can help to flag articles and posts that might be dubious, but data and analytics can do nothing about people who are in search of stories – no matter the validity of those stories – that support their preconceived views and beliefs.

      Maybe it’s always been like that. You mention “Yellow Journalism”, which has been around for over a hundred years now. I guess as long as people are too lazy to seek the truth, then “yellow journalism” is here to stay.

      Thanks again for taking the time to share your story.