Big Data

Management Challenge: P-hacking or “The Danger of Alternative Facts”

Bill Schmarzo By Bill Schmarzo CTO, Dell EMC Services (aka “Dean of Big Data”) May 23, 2017

Mark Twain famously said “There are three kinds of lies: lies, damned lies, and statistics.” Maybe if Mark Twain were alive today, he’d add “alternative facts” to that list. BusinessWeek in the article “Lies, Damn Lies, and Financial Statistics” reminds us of the management challenges with statisticians or “data scientists” who manipulate the data to create “pseudo-facts” that can lead to sub-optimal or even dangerously wrong decisions.

My University of San Francisco class recently did a hackathon with a local data science company. It was insightful for all involved (and I learned of a new machine learning tool – BigML – which I will discuss in a future blog). One team was trying to prove that the housing prices in Sacramento where on the same pricing trajectory (accelerating upward) as the Brooklyn and Oakland housing markets, cities next to major technology hubs. She had just bought a house in Sacramento and was eager to prove that her investment was a sound one. Unfortunately, her tainted objective caused her to ignore some critical metrics that indicated that Sacramento was not on the same pricing trajectory. She got the model results that she was seeking, but she did not necessarily get the truth.

Statisticians know that by selectively munging, binning, constraining, cleansing and sub-segmenting one’s data set, they can get the data to tell almost any story or validate almost any “fact.” Plus with the advent of seductive data visualization tools, one can easily distract the reader from the real facts and paint a visually compelling but totally erroneous story from the data (see Figure 1).

Figure 1:  Distorting The “Truth” with Data Visualization

Figure 1:  Distorting The “Truth” with Data Visualization

The Weapon of Alternative Facts: p-Hacking

This flawed statistical behavior of staging and presenting the data in a way that supports one’s already pre-conceived answer has a name: p-hacking. As described in the BusinessWeek article, p-hacking is a reference to the p-value, which is a measure of statistical significance. To quote Andrew Lo, director of MIT’s Laboratory of Financial Engineering: “The more you search over the past, the more likely it is you are going to find exotic patterns that you happen to like or focus on. Those patterns are least likely to repeat.”

What is the ‘P-Value’[1]?  The p-value is the level of marginal significance within a statistical hypothesis test representing the probability of the occurrence of a given event. A small p-value (less than .05) means that there is stronger evidence in favor of the alternative hypothesis. The hacking of a p-value can sometimes inadvertently happen through a statistical practice known as overfitting.

Overfitting Challenges

Understanding the concept of overfitting is important to make sure that one is not taking inappropriate liberties with the data. Ensuring that one isn’t inadvertently p-hacking the data requires 1) an understanding of overfitting and 2) a touch of common sense. From the BusinessWeek article:

“An abundance of computing power makes it possible to test thousands, even millions, of trading strategies. [For example], the standard method is to see how the [trading] strategy would have done if it had been used during the ups and downs of the market over, say, the past 20 years. This is called backtesting. As a quality check, the technique is then tested on a separate set of “out-of-sample” data—i.e., market history that wasn’t used to create the technique. In the wrong hands, though, backtesting can go horribly wrong. It once found that the best predictor of the S&P 500, out of all the series in a batch of United Nations data, was butter production in Bangladesh.”

So what is evil thing called overfitting? Overfitting occurs when the analytic model is excessively complex, such as having too many parameters or variables in the model relative to the number of observation points.

I love the overfitting example shown in Figure 2. In Figure 2, one is trying to fit the different shapes on the left side of the figure into one of the containers that minimizes the unused space in the container. The shapes fit into the right-most container with the lowest amount of unused space[2].

Figure 2:  Fitting Shapes Into Containers

Figure 2:  Fitting Shapes Into Containers

However when a new shape is added in Figure 3, a shape significantly different than the shapes used to minimize the container space, the originally selected container does not work because the container was over-fitted for just those shapes that were originally available.

Figure 3:  Overfitting

Figure 3:  Overfitting

Once new data gets added, especially data that might be different from the test data in some significant way (e.g., different time periods, different geographies, different products, different customers), the risk that the model that was created with the original set of data just doesn’t work with the next set of data that is not exactly like the original set.

P-hacking Summary

Okay, this is the most nerdy of jokes, but the web comic Randall Munroe captures the over-fitting management challenge creatively in the comic of a woman claiming jelly beans cause acne (see Figure 4).

Figure 4: The Most Nerdy Overfitting Example

When a statistical test shows no evidence of an effect, the woman revises her claim that acne must depend on the flavor of the jelly bean. So the statistician tests 20 flavors. Nineteen show nothing. But by chance there’s a high correlation between jelly bean consumption and acne breakouts for green jelly beans. The final panel of the cartoon is the front page of a newspaper: “Green Jelly Beans Linked to Acne! 95% Confidence. Only 5% Chance of Coincidence!”

Yea, it’s nerdy but it also makes this final critical point: use your common sense to contemplate whether the correlation really exists or not. There are plenty of examples of distorting the data and applying statistics to show the most absurd correlations (see Figure 5).

I mean, look at the correlation for that relationship (0.993)!  The relationship between United States spending on the space program and suicides must be true!

Additional reading:

 

[1] http://www.investopedia.com/terms/p/p-value.asp

[2] Special thanks to John Cardente, from Dell EMC’s Office of the CTO for the overfitting slides

Bill Schmarzo

About Bill Schmarzo


CTO, Dell EMC Services (aka “Dean of Big Data”)

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Dell EMC’s Big Data Practice. As a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Read More

Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

5 thoughts on “Management Challenge: P-hacking or “The Danger of Alternative Facts”

  1. So much data and so little time to really figure out whether the person disseminating the results has a veiled agenda or at a minimum is introducing some sort of bias; seems to me that all the data collection devices that have been introduced just feed into this mania. This poses a huge challenge for human consumers of data.

    • Andy, I agree that the proliferation of social media sites and the dramatic increase in postings on those social media sites is making it very difficult for humans to discern real news from opinions from outright lies. However, this is where technology, specifically machine learning, could help us lowly humans. We could use machine learning to create a “Reliability and Credibility Score” on the story and the publisher. Social media sites could use such a score make to inform the reader as to the “predicted” level of truth with respect to the article, and even use the score to prevent the spread of fake news. This is a great topic for a more detailed blog. Watch this space!

  2. Great introduction to p-hacking, overfitting, selective correlations, and visualization manipulations. The article is well written and creatively takes advantage of the unlimited real estate of the blog medium – no newspaper would run a 20+ panel cartoon.
    Could an AI & Big Data engine help citizens discover the real truth when politicians deliver their talking points? Probably not as long as the politicians are able to back test their talking points prior to the data delivery. But boy, that would make for a long and attribute riddled debate.

    • Thanks John for the feedback. I certainly do think that AI (especially Natural Language Processing) and Big Data could help to flag alternative facts or half-truths, but probably can’t eliminate it entirely (related to Type I and Type II errors and wanting to err on the side of letting some articles through to ensure that we’re not stifling free speech). Sounds like a topic for a future blog!!

      Thanks again!
      Bill