Thinking Like A Data Scientist Part IV: Attitudinal Approach
I was reading an article in the July 2nd issue of BusinessWeek titled “Kill Your Desk Chair” where they cite the following “fact:”
A recent study, from the Pennington Biomedical Research Center in Baton Rouge, LA, followed 17,000 Canadians over 12 years and found that those who sat for most of the day were 54% more likely to die of heart attacks that those that didn’t.
Now being a person who spends a lot of time sitting behind a desk, or on an airplane, or at sporting events, this “54% more likely to die of heart attacks” fact is very concerning. Should I throw out my current desk and buy one of those expensive “stand up and work”-type desks?
Then I started to think like a data scientist, and started to challenge the assumption that there is some sort of causation between sitting and heart attacks. Some questions that immediately popped to mind included:
- Are there other variables, like lack of exercise or eating habits or age, which might be the cause of the heart attacks?
- Was a control group used to test the validity of the study results?
- Is there something about Canadians that makes them more susceptible to sitting and heart attacks?
- Who sponsored this study? Maybe the manufacturer of these new expensive “stand up and work”-type desks?
One needs to be a bit skeptical when they hear these sorts of “factoids.” We should know better than to just believe these sorts of claims blindly. We’ve all heard the weatherman state that there is a 60% chance of rain on days when there isn’t a cloud in the sky (I guess the weatherman could have just flipped a coin and made as accurate a prediction). And the recent Governor Walker recall election in Wisconsin raised all sorts of concerns when the early exit polls predicted incorrectly the actual results of the recall vote (and caused some folks to demand a recount of the actual vote because the exit polls didn’t match the actual results).
Correlation Is Not Causality
A good data scientist knows that there is a big difference between correlation and causality. Causality is the ability to quantify cause and effect, and just because two items move in tandem, does not mean that there is causality. The relationship between the events may not even make logical sense. Here are some examples:
Figure 1: Is Facebook Driving The Greek Debt Crisis?
Do we really think that the growth in the number of active Facebook users is actually driving up the yield on the 10-year Greek government bonds? Unless joining Facebook requires all subscribers to sell their 10-year Greek government bonds, there is no causality in this correlation.
Figure 2: Is this random mountain range driving Murders in New York?
Do we really think that this mountain range is driving the murder rate in the state of New York?
Figure 3: Is there causality between newspapers bought and M. Night Shyamalan movies?
Okay, this one might actually be true…
Thinking Like a Data Scientist: Having a Dubious Attitude
Thinking like a data scientist requires imagination, curiosity and a lot of skepticism to question or challenge whatever analytic insights are derived out of the data. Don’t forget that common sense makes a good yardstick to apply against any analytic results. A good data scientist tends to:
- Be very clear and thorough on defining the hypothesis (and null hypothesis) they are testing; to clearly and articulately state the problem that they are trying to solve and what determines a statistically valid result.
- Embrace an exploratory, discovery, visually inspective analytic process to understand, validate and cleanse the data by throwing out incomplete, inappropriate or inaccurate data, and to not let outliers skew the results.
- Focus on identifying patterns and quantifying correlations (quantifying cause-and-effect) out of the data through statistical, descriptive and predictive analytics.
- Grabs whatever data might be available, whether or not the data scientist is even sure that they will use that data, and worries about the data integration issues as the come up in the analytic process.
- Leverages an ELT (Extract Load Transform) process to use in-database processing capabilities to create new metrics that might be better predictors of performance.
- Tolerant of “good enough” data to fuel “good enough” decisions; deals with future predictive measures, not historical facts, such that probabilities and the likelihoods of future performance are the typical outputs.
What Does This Mean to the Business User?
So what does this mean to you as a decision maker?
- Properly set up your hypothesis and detail out your analytic plan. Be clear and precise on what you are trying to prove and the business objectives. Be as granular, transparent and thorough as possible. If you are not clear, you might end up with a “correct” answer that is actually unusable (see chart to the right).
- Thoroughly document your business assumptions. Allow others to review and challenge the reasonableness and validity of your assumptions. Constantly ask if the assumptions are reasonable and realistic. Don’t forget the importance of at least contemplating “black swan” events in hour model assumptions.
- Plan for experimentation especially to test those model assumptions that have the biggest influence on the analytic results. Use sample groups to ensure that you are comparing apples to apples. Determine if you have failed enough (explored enough other options) before declaring victory.
- Properly interpret and apply results. Apply the common sense test. Are the results reasonable and are they actionable?
In summary, don’t accept the analytic results blindly just because they come with precise-looking numbers and probabilities. Challenge the analytic results and conclusions drawn from the analytic models, especially from those who may not have the analytic credentials, experience or even the context of the business case against with the hypothesis is being tested. There have been some classic bad decisions drawn from what looked like rock-solid statistical analysis.
To learn more about EMC’s unique approach to leveraging Big Data to drive business value, please check out EMC’s Big Data Vision Workshop offering.
 Graphic courtesy of climbingoutofthedark.blogspot.com