Weaving Data Hay Into Business Gold
The data lake is certainly becoming a hot discussion topic in most of my client meetings nowadays. Just recently, I had several clients in a meeting where they raised the concern that adding more data to the data lake would just make it harder for them to “find the needle in the haystack.” When one is building out the data lake, one does not want to just “dump” more data into the data lake. That will lead to a data “swamp.”
There are a number of activities – many of which are activities that we’ve been doing for years, heck even decades, as part of a solid Enterprise Information Management (EIM) discipline – that need to take place in order to ensure that your data lake doesn’t become a data dump, or swamp. Activities and disciplines such as data cataloging, metadata development and management, auditing / traceability / lineage and graduated levels of data governance (e.g., heavily governed, lightly governed, not governed) are just a few that need to be covered in order to ensure that the data in the data lake is discoverable, usable and (relatively) accurate for the analytic needs.
Figure 1 below lays out some of the components that we think need to be covered as an organization builds out their data lake.
However, I want to challenge the “find a needle in the haystack” as the wrong analogy for the data lake. That is a data warehouse / Business Intelligence way of thinking about analysis; to slice-and-dice the data haystack trying to find needles. Instead, I want you to “think different” and contemplate the story of Rumpelstiltskin as a better analogy for uncovering the business value buried in data lake. Let me explain.
Rumpelstiltskin: A Big Data Lesson
The story of Rumpelstiltskin is about a miller who lies to the king, telling him that his daughter can spin straw (hay) into gold. The daughter is forced to spin the straw into gold three times or the king will cut off her head. But if she is successful, the king will instead marry her. Since she can’t really spin straw into gold, a strange imp-like creature offers to spin the hay into gold in exchange for something of value. The first time he is paid with a necklace, the second time he is paid with a ring, but on the third time the girl has run out of items of value, so she is forced to promise the imp her first born child.
When their first child is born, the imp returns to claim his payment. The now-queen offers him all the wealth she has if she may keep the child, but the imp has no interest in her riches. He finally consents to give up his claim to the child if the queen can guess his name within three days. After failing for two days to guess his name, she wanders out into the woods and comes across the imp hopping around a fire and singing, “tomorrow, tomorrow, tomorrow, I’ll go to the king’s house, nobody knows my name, I’m called “Rumpelstiltskin”.
And, well, you can figure out the rest of the story.
Data discovery is like trying to “find a needle in a haystack”; however, data science with a data lake is more like trying to “weave data hay into business gold.” So instead of thinking about the data lake as this haystack from which you are trying to find needles, think instead about the data lake as the loom for your data where you weave data hay into business gold.
Data Lake, Data Science and Scores
Let’s take this analogy one more step. One of my favorite data science books, “Moneyball”, advocates that:
[Data Science] is about finding variables that are better predictors of performance
For many organizations, the data science team creates predictive “scores” that help them better predict what’s important to their business. Probably the best example of a predictive “score” is the FICO Score (see Figure 2).
FICO (acronym for Fair Isaac Corporation) score is a type of credit score that makes up a substantial portion of the credit report that lenders use to assess an applicant’s credit risk and whether to extend them a loan. Using mathematical models, the FICO score takes into account various factors including payment history, current level of indebtedness, types of credit used, length of credit history, and new credit. A person’s FICO score will range between 300 and 850. In general, a FICO score above 650 indicates that the individual has a very good credit history. People with scores below 620 will often find it substantially more difficult to obtain financing at a favorable rate.
There are opportunities for your data scientists to create these predictive “scores” across a number of different industries to support what’s important to your business. For example, a financial services firm may want to create a “Retirement Readiness” score for each of its clients, that takes into consideration their current net worth, current and projected value of their home, current and projected annual income, savings rate, spending patterns (which could be gleaned from sources such as Mint.com), number of dependents (children and parents), etc. The financial services firm may want to balance this “Retirement Readiness” score with a “Risk Tolerance” score that measures how much financial and investment risk the client is willing to bear which could include information such as age, years to retirement, number of dependents, years at current job, job title, location and behavioral classifications gleaned from on-line gambling and investment patterns. The combination of “Retirement Readiness” and “Risk Tolerance” scores gives the financial advisor the necessary insights at the individual customer level to make the most appropriate investment and budgeting decisions. Figure 3 shows other potential scores across different industries.
Data Science and the Data Lake
One of the most important benefits of the data lake is enablement of your data science team. The data lake frees up the data science team from being handcuffed by limitations in the data warehouse. The data in the data warehouse has been optimized for Business Intelligence reporting and dashboards – aggregate tables, indices and materialized views as part of pre-defined data schema designed to address the business monitoring needs of the organization. The Business Intelligence and data warehouse is focused on understanding “What happened?”
The data science team is trying to do something different; they are trying to predict what might happen and make evidence-based recommendations as to what actions or decisions the customers and front-line employees should make based upon those predictions of what might happen.
The data lake is going to be a hot and critical topic over the next 18 to 24 months. And there will be lots of temptations to allow the conversation to digress into a technology only conversation. However, one of the most important benefits of the data lake is enabling your data science team to mine and enrich the data looking for those better predictors of performance. The primary goal of the data lake, from a business perspective, is to think different and think Rumpelstiltskin; to enable your data science team to not just find needles in haystacks, but instead think about how they “weave the data hay into business gold.”