Big Data

Is Data Science Really Science?

Bill Schmarzo By Bill Schmarzo CTO, Dell EMC Services (aka “Dean of Big Data”) January 30, 2017

My son Max is home from college and that always leads to some interesting conversations.  Max is in graduate school at Iowa State University where he is studying kinesiology and strength training.  As part of his research project, he is applying physics to athletic training in order to understand how certain types of exercises can lead to improvements in athletic speed, strength, agility, and recovery.

Data and Science

Figure 1:  The Laws of Kinesiology

Max was showing me one drill designed to increase the speed and thrust associated with jumping (Max added 5 inches to his vertical leap over the past 6 weeks, and can now dunk over the old man).  When I was asking him about the science behind the drill, he went into great details about the interaction between the sciences of physics, biomechanics and human anatomy.

Max could explain to me how the laws of physics (the study of the properties of matter and energy.), kinesiology (the study of human motion that mainly focuses on muscles and their functions) and biomechanics (they study of movement involved in strength exercise or in the execution of a sport skill) interacted to produce the desired outcomes.  He could explain why it worked.

And that is the heart of my challenges with treating data science as a science.  As a data scientist, I can predict what is likely to happen, but I cannot explain why it is going to happen.  I can predict when someone is likely to attrite, or respond to a promotion, or commit fraud, or pick the pink button over the blue button, but I cannot tell you why that’s going to happen.  And I believe that the inability to explain why something is going to happen is why I struggle to call “data science” a science.

Okay, let the hate mail rain down on me, but let me explain why this is an important distinction!

What is Science?

Science is the intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment.

Science works within systems of laws such as the laws of physics, thermodynamics, mathematics, electromagnetism, aerodynamics, electricity (like Ohm’s law), Newton’s law of motions, and chemistry.  Scientists can apply these laws to understand why certain actions lead to certain outcomes.  In many disciplines, it is critical (life and death critical in some cases) that the scientists (or engineers) know why something is going to occur:

  • In pharmaceuticals, chemists need to understand how certain chemicals can be combined in certain combinations (recipes) to drive human outcomes or results.
  • In mechanical engineering, building engineers need to know how certain materials and designs can be combined to support the weight of a 40 story building (that looks like it was made out of Lego blocks).
  • In electrical engineering, electrical engineers need to understand how much wiring, what type of wiring and the optimal designs are required to support the electrical needs of buildings or vehicles.

Again, the laws that underpin these disciplines can be used to understand why certain actions or combinations lead to predictable outcomes.

Big Data and the “Death” of Why

An article by Chris Anderson in 2006 titled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” really called into question the “science” nature of the data science role.  The premise of the article was that the massive amounts of data were yielding insights about the human behaviors without requiring the heavy statistical modeling typically needed when using sampled data sets.  This is the quote that most intrigued me:

“Google conquered the advertising world with nothing more than applied mathematics. It didn’t pretend to know anything about the culture and conventions of advertising — it just assumed that better data, with better analytical tools, would win the day. And Google was right.”

With the vast amounts of detailed data available and high-powered analytic tools, it is possible to identify what works without having to worry about why it worked.  Maybe when it comes to human behaviors, there are no laws that can be used to understand (or codify) why humans take certain actions under certain conditions.  In fact, we already know that humans are illogical decision-making machines (see “Human Decision-Making in a Big Data World”).

However, there are some new developments that I think will require “data science” to become more like other “sciences.”

Internet of Things and the “Birth” of Why

The Internet of Things (IOT) will require organizations to understand and codify why certain inputs lead to predictable outcomes.  For example, it will be critical for manufacturers to understand and codify why certain components in a product break down most often, by trying to address questions such as:

  • Was the failure caused by the materials used to build the component?
  • Was the failure caused by the design of the component?
  • Was the failure caused by the use of the component?
  • Was the failure caused by the installation of the component?
  • Was the failure caused by the maintenance of the component?

As we move into the world of IOT, we will start to see increased collaboration between analytics and physics.  See what organizations like GE are doing with the concept of “Digital Twins”.

The Digital Twin involves building a digital model, or twin, of every machine – from a jet engine to a locomotive – to grow and create new business and service models through the Industrial Internet[1].

Digital twins are computerized companions of physical assets that can be used for various purposes. Digital twins use data from sensors installed on physical objects to represent their real-time status, working condition or position[2].

GE is building digital models that mirror the physical structures of their products and components.  This allows them to not only accelerate the development of new products, but allows them to test the products in a greater number of situations to determine metrics such as mean-time-to-failure, stress capability and structural loads.

As the worlds of physics and IOT collide, data scientist will become more like other “scientists” as their digital world will begin to be governed by the laws that govern disciplines such as physics, aerodynamics, chemistry and electricity.

Data Science And The Cost of Wrong

Another potential driver in the IOT world is the substantial cost of being wrong.  As discussed in my blog “Understanding Type I and Type II Errors”, the cost of being wrong (false positives and false negatives) has minimal impact when trying to predict human behaviors such as which customers might respond to which ads, or which customers are likely to recommend you to their friends.

However in the world of IOT, the costs of being wrong (false positives and false negatives) can have severe or even catastrophic financial, legal and liability costs.  Organizations cannot afford to have planes falling out of the skies or autonomous cars driving into crowds or pharmaceuticals accidently killing patients.

Summary

Traditionally, big data historically was not concerned with understanding or quantifying “why” certain actions occurred because for the most part, organizations were using big data to understand and predict customer behaviors (e.g., acquisition, up-sell, fraud, theft, attrition, advocacy).  The costs associated with false positives and false negatives were relatively small compared to the financial benefit or return.

And while there may never be “laws” that dictate human behaviors, in the world of IOT where organizations are melding analytics (machine learning and artificial intelligence) with physical products, we will see “data science” advancing beyond just “data” science.  In IOT, the data science team must expand to include scientists and engineers from the physical sciences so that the team can understand and quantify the “why things happen” aspect of the analytic models.  If not, the costs could be catastrophic.

 

[1] https://www.ge.com/digital/blog/dawn-digital-industrial-era

[2] https://en.wikipedia.org/wiki/Digital_Twins

Bill Schmarzo

About Bill Schmarzo


CTO, Dell EMC Services (aka “Dean of Big Data”)

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting the strategy and defining the Big Data service offerings and capabilities for Dell EMC Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He’s written several white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power the organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill was ranked as #4 Big Data Influencer by Onalytica.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored Dell EMC’s Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements, and co-authored with Ralph Kimball a series of articles on analytic applications. Bill has served on The Data Warehouse Institute’s faculty as the head of the analytic applications curriculum.

Previously, Bill was the vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a masters degree in Business Administration from the University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Read More

Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

4 thoughts on “Is Data Science Really Science?

  1. This definition of “science” feels reductionist to me, Bill. Consider quantum mechanics, which challenges that physical laws have anything at all to do with reality – at least at the atomic / subatomic level. Statistical modeling of a process (like this Digital Twin initiative you call out) is perhaps a kind of meta-science, if I may coin a cool sounding $5 word of dubious merit.

    Understanding the probability that a thing will happen is arguably a more fundamental knowledge than an “understanding” of why something might occur. Because as quantum mechanics has shown us, our “whys” are not nearly as immutable as we thought they were. Humans want things – the whole universe in fact – to be reducible, understandable, comprehensible. Yet sometimes, the best we can get is to discover that a thing is 20% likely to occur based on Data from prior observations. No laws, no whys, but a quantification of the possible.

    Maybe “why” is a noble goal, maybe it is tilting at windmills. But I do not think you should discount as science the study and measure of data’s explanatory power (in the statistical sense, not the metaphysical sense of “why”) over process and event. From my vantage point, questing after the probability of an event is far more audacious, more powerful, more fundamental a pursuit than the self-delusional creation of theories and laws – rules that futilely attempt to bind the majesty and chaos of the universe.

    • Hey Scott! Always good for to hear from you. Quantum physics? Meta-science? Man, I miss our conversations over beers… or coffee… or beer and coffee!

      Here are a few of points that caused me to rethink the term “data scientist” from a science perspective:

      1. Science tends to deal with “laws” or logic – some known and some (like quantum physics) that we are still discovering. Science is governed by some absolutes (mix vinegar with baking soda and you know what’s going to happen and why!)

      2. Much of big data to-date has focused on modeling human behaviors (customer acquisition, customer attrition, promotional effectiveness, treatment effectiveness, attribution analysis), and we know that humans are not governed by logic (or laws, sometimes). See “Human Decision-Making in a Big Data World” – https://infocus.emc.com/william_schmarzo/human-decision-making-in-a-big-data-world/

      3. As we move into the world of the Internet of Things, we will begin to model physical devices like what GE is doing with the digital twins concept. This likely foreshadows the integration of physical sciences (like physics and chemistry) with big data and data science to create IOT solutions.

      Does this mean that we will change the types of analytics that we will build using different modeling techniques and approaches? It could, because we might rely less on trying to predict what is going to happen versus knowing exactly what is going to happen…you know, mixing vinegar with baking soda.

  2. Hi Bill, A very clever article title! It drew me in. I was almost going to reply with fury after the opening section, but as I know your blog, I said I would read on. It’s an interesting space indeed. Working as a data scientist and having graduated with a Bachelor of ‘Science’ degree in Applied Mathematics and Computing many moons ago, I have always seen it as a science. A colleague of mine in SAS recounted a conversation he had with an Intel executive, when he was told they have 10,000 data scientists on the factory floor of the particular plant he was visiting, all with B.Sc. or B.Eng degrees. Having said that, as you describe, the IoT space definitely need the scientists, engineers and data folks all working together. – As they always were in many places.

    • Thanks David for the feedback. Maybe as is typical of all “sciences”, there is a period of discovery where we are discovering the underlying systems and their relationships. Until we understand these underlying systems and their relationships, much of what we do looks like voodoo and black magic. But once we can quantify (model) the underlying systems and their interrelationships, then the discipline does look more like a science where we can not only predict outcomes, but also understand why those outcomes. I think that is where we are going to go with the Internet of Things and the Industrial Internet, and the integration of the data science team with the physical science teams.

      Regarding predicting human outcomes and understanding why their outcomes, given the irrational nature of humans, I guess that will always remain voodoo and black magic.