How I’ve Learned to Stop Worrying and Love the Data Lake
I have to admit that I’ve struggled with the term “Data Lake.” I first heard the term used in 2010 in some marketing collateral about Hadoop from Pentaho (“Pentaho, Hadoop, and Data Lakes”) and was confused by the use of the term and the explanations. Maybe I was confused because lots of the earlier discussions were about how Hadoop would obfuscate (a.k.a. render obsolete) the need for an enterprise data warehouse.
As I came to explore with more companies the role of Hadoop within an organization’s enterprise data architecture, I came to realize that the data lake concept wasn’t a replacement to the enterprise data warehouse, but instead complements the enterprise data warehouse. And in many cases, the data lake concept can actually liberate the enterprise data warehouse to do more of what it does best—provide the ability for business analysts to monitor and analyze the historical performance of the organization. Let me explain how I’ve learned to stop worrying and love the data lake.
Example: Analyzing Point of Sale Data
One of the traditional data warehouse examples is for retailers to analyze point-of-sale (POS) transaction data (see Figure 1) to understand market basket analysis and answer questions such as:
- What were the average sales per market basket?
- What is the average margin per market basket?
- What products appear most often in market baskets?
- What is the average percentage of Private Label Products per market basket?
- What is the typical distribution of product categories per market basket?
- What products tend to sell in combination as part of the same market basket?
And of course, I want to answer these questions across the multitude of business dimensions (using classic business intelligence “by” analysis) such as store location, store demographics, product category, time of year, day of week, time of day, promotion, customer type, customer demographics, etc.
This sort of POS data is typically stored in the big data warehouse where these types of questions, and the accompanying business intelligence (BI) analysis (drilling up, drilling down, drilling across) and “by” analysis (“I want to see market basket sales by…”) can be accomplished.
However, what if I wanted to know which sales transactions were scanned into the POS system versus which ones were hand-coded into the POS system? That information might be important for the following reasons:
- Are there specific products for which the scanner doesn’t work and for which the sales clerk needs to manually enter the UPC code into the POS system? This might indicate that certain Consumer Package Goods manufacturers are placing their UPC codes in locations on the product packaging that make it harder for the scanner to read; for example, having the UPC code located at the curve of the toilet paper packaging.
- Are there specific sales clerks that hand-code more products than other sales clerks? This might indicate a training problem.
- Are there specific scanners for which more products are hand-coded versus machine scanned? This might indicate scanners that might need maintenance or replacement.
To answer these questions, I can’t use the POS receipt. I have to have the “t” logs from the actual transactions that come off of the POS cash register. The “t” logs contain the raw data about each transaction: the exact time (hour:minute:second) of the transaction, the sales clerk’s ID, how the transaction was entered into the system, etc. The “t” log provides the detailed data necessary to answer the questions posed above about hand-entering UPC codes into the POS system. To answer questions like these, where I need to have access to the detailed transaction logs in its raw format, is the reason I need a data lake.
The data lake is becoming important because I can load my raw, unaltered structured and unstructured data into the data lake as-is without worrying about defining the data model schema before I can load the data. Think “schema on read” (where I define my data model schema and data requirements when I query the data) versus “schema on load” (where I have to define my data model schema as I load the data into the data repository). hink of it as the difference between the POS data analysis versus the “t” log analysis.
Central Data Repository: Data Lake versus Data Warehouse?
Readers of my book “Big Data: Understanding How Data Powers Big Business” and followers of my blog have seen the architectural layout below several times (see Figure 2).
Here are previous blogs that I wrote on this topic:
- The Data Warehouse Modernization Act blog
- Modernizing Your Data Warehouse Part 2 blog
- Modernizing Your EDW / Analytics Environment and In-memory Analytics blog
So I don’t make the readers have to reread each of these blog, let me summarize the differences between each of the major architectural components: the BI/EDW environment, the analytics environment, and the data store or data lake.
- The BI/EDW Environment is your traditional data warehouse that supports the business analysts’ questions and organizational reporting and dashboard needs. This is a production environment with very predictable loads that is SLA-driven and heavily governed. The data in the EDW must be 100% accurate or people go to jail. Most organizations look to enforce data transformation, databases, and BI tools at this level in order to drive down costs and ensure an SLA-compliant environment.
- The Analytics Environment is where your data scientists can self-provision compute environments and desired data sources in order to freely mine the data. This environment is almost the polar opposite of the BI/EDW environment: it’s an exploratory environment with very unpredictable load and usage patterns. It’s an environment where the data scientists need to be free to experiment with new data sources, new data transformations, and new analytic models in order to uncover new insights buried in the data and build predictive and prescriptive models of the key business process. It’s loosely governed and typically allows the data scientists to use whichever tools they prefer in their exploration, analysis, and analytic modeling.
- The Data Lake is the central repository where all the data is loaded “as is.” It should also support data federation to provide access to lightly used data sources that are not a physical part of the data lake, but appear to be. For example, you may not want to download ALL of your detailed social media data from sites such as Facebook, Twitter, Pinterest, Instagram, Tumblr, LinkedIn, Yelp, and Google+ into your data lake, but instead provide a conduit (via the social media site APIs) to those sites for gaining access to their detailed data as needed.
The Hadoop data lake repository can store ALL the organization’s data “as is” in a low-cost HDFS environment (without the added burden of predefining your data schemas), and then feed both the production enterprise data warehouse environment / business intelligence environment and the ad hoc, exploratory analytics sandbox as necessary. Get comfortable to the fact that the data lake may contain data that has no intention of ever reaching the data warehouse.
EDW Enhancement Example: Do the ETL/ELT In The Data Lake
Doing ETL (Extract, Transform, Load) within your data warehouse is common today. However, if your data warehouse is already overloaded, why do that batch-centric, data management heavy work in an expensive environment?
Instead, move the ETL processes off your EDW platform and do the ETL/ELT (Extract, Load, Transform) work in an inherently parallel, open source, cost-effective, scale-out environment like Hadoop. Doing the ETL (as well as ELT) within Hadoop allows you to leverage that natively parallel environment to bring the appropriate compute capabilities to bear at the appropriate times to get the job done more quickly and more cost effectively.
Not only does using Hadoop for your ETL/ELT work make sense from the perspective of cost and processing effectiveness, but it also gives you the capability to create new data transformations that are difficult to do using traditional ETL tools. For example, the creation of new metrics around customer and product performance leveraging frequency (how often), recency (how recently) and sequencing (in what order), can yield new insights that might be better predictors of customer behaviors and product performance.
Beware That Your Data Lake Doesn’t Become Your Data Garage
We all know what our garage looks like. Tons of boxes, some unopened from the previous move, sit buried in the garage. And in California it’s even worse, as most people park their overly expensive cars in the streets so that they can pack more junk in their garages.
The garage has truly become a dumping ground for everything that we thought at one time or another might be valuable. The writers for the movie “Raiders of the Lost Ark” got it right when they decided that the best way to hide the invaluable Ark of the Covenant was in a massive warehouse. Yep, Figure 3 looks like my garage.
Joe Dossantos here at EMC uses a metaphor to talk about this “finding the data” challenge with respect to the data lake. Let’s say that you had the capacity to build a beautiful new library to store any book that has ever been written. If you built it and a truck dumps 1 million volumes into the reading room, what value would that be to the people looking for a particular book or theme of books? We need to take into consideration as we design our data lake:
- How do you develop the equivalent Dewey Decimal system to help people find the things that they are looking for in the data lake?
- How can you deliver even more value from understanding the contents (metadata) of the data lake? Couldn’t you help people understand the general idea of the contents without reading each book? What is the general opinion of Napoleon? Was the War of 1812 a good idea?
The solution to this problem is already there and available with things like SOLR. With SOLR, you can know not only where and what data is available in the data lake, but also understand what the data in the lake actually means.
The data lake provides a magnitude improvement to your data architecture in terms of capabilities and agility. Not only does it free up expensive EDW resources, but also it enables a self-sufficient analytics environment whose data requests can be fulfilled without needlessly screwing up the EDW’s Service Level Agreements.
And this is not just wishful thinking. Figure 4 is an example of a company that is rapidly embracing the data lake concept, not only to free up EDW resources and enable the big data analytics sandbox, but is also looking at the data lake as the foundation for their future EDW as well. Lots of interesting and liberating data architecture and data management approaches are going to get blown up. Time to stop worrying and learn to love the data lake.