Why Do I Need A Data Lake?
The data lake is gaining lots of momentum across the different customers to whom I talk. Every, and I mean every organization wants to learn why and how to implement a data lake. But “because it is a cheaper way to store/manage data” is not a good reason to adopt a data lake. The “Why do I need a data lake?” answer is much more powerful than just having the IT organization save some money.
The data lake is a powerful data architecture that leverages the economics of big data (where it is 20x to 50x cheaper to store, manage and analyze data as compared to traditional data warehouse technologies). And new big data processing and analytics capabilities help organizations address business and operational challenges that were difficult to address using conventional Business Intelligence and data warehousing technologies.
The data lake has the potential to transform the business by providing a singular repository of all the organization’s data (structured AND unstructured data; internal AND external data) that enables your business analysts and data science team to mine all of organizational data that today is scattered across a multitude of operational systems, data warehouses, data marts and “spreadmarts”.
Analytics Hub and Spoke Service Architecture
The value and power of a data lake are often not fully realized until we get into our second or third analytics use case. Why is that? Because it is at that point where the organization needs the ability to self-provision an analytics environment (compute nodes, data, analytic tools, permissions, data masking) and share data across traditional line-of-business silos (one singular location for all the organization’s data) in order to support the rapid exploration and discovery processes that the data science team uses to uncover variables and metrics that are better predictors of business performance. The data lake enables the data science team to build the predictive and prescriptive analytics necessary to support the organization’s different business use cases and key business initiatives.
Joe Dossantos, the head of EMC Global Services Big Data Delivery team, termed this a “Hub and Spoke” analytics environment where the data lake is the “hub” that enables the data science teams to self-provision their own analytic sandboxes and facilitates the sharing of data, analytic tools and analytic best practices across the different parts of the organization (see figure 1).
The hub of the “Hub and Spoke” architecture is the data lake. The data lake has the following characteristics:
- Centralized, singular, schema-less data store with raw (as-is) data as well as massaged data
- Mechanism for rapid ingestion of data with appropriate latency
- Ability to map data across sources and provide visibility and security to users
- Catalog to find and retrieve data
- Costing model of centralized service
- Ability to manage security, permissions and data masking
- Supports self-provisioning of compute nodes, data, and analytic tools without IT intervention
The spokes of the “Hub and Spoke” architecture are the resulting analytic use cases that have the following characteristics:
- Ability to perform analytics (data scientist)
- Analytics sandbox (HDFS, Hadoop, Spark,, Hive, HBase)
- Data engineering tools (Elastic Search, MapReduce, YARN, HAWQ, SQL)
- Analytical tools (SAS, R, Mahout, MADlib, H2O)
- Visualization tools (Tableau, DataRPM, ggplot2)
- Ability to exploit analytics (application development)
- 3rd platform application (mobile app development, web site app development)
- Analytics exposed as services to applications (API’s)
- Integrate in-memory and/or in-database scoring and recommendations into business process and operational systems
The Analytics “Hub and Spoke” architecture enables the data science team to develop the predictive and prescriptive analytics that are necessary to optimize key business processes, provide a differentiated customer engagement and uncover new monetization opportunities.
Beware The False Prophets!
You know something must be on the right track when the market incumbents are working so hard to either discredit or spread confusion. And that seems to be the case for the data lake. Lots of vendors, press and analysts are trying to position the data lake as just an extension to the data warehouse; as data warehouse 2.0. And with that sort of thinking, we risk repeating many of the fatal mistakes we made with data warehousing.
Confusion #1: “Feed the Data Lake from the Data Warehouse.”
That’s ridiculous and is being pushed by traditional data warehouse vendors as the most appropriate use of the data lake. Sorry, but that’s like inventing the jet engine and then saying that you’re going to pull it with a horse and buggy.
Loading data into a data warehouse means that someone has already made assumptions about what data, level of granularity and amount of history is important. You have to make those assumptions in order to pre-build the data warehouse schema. And that means that the raw data has already gone through data transformations (and content elimination) in order to get the data to fit into the data warehouse schema. Lots of assumptions being made a priori about what data, data granularity and data history is important when the only purpose of the data warehouse is to report on what happened!! That’s like going wine tasting and swabbing Vaseline on your tongue! Many of the valuable nuances in the data have been removed in order to aggregate the data to fit into a reporting-centric data schema.
As you can see in figure 2, the data lake sits in front of the data warehouse to provide a data repository that can leverage the “economics of big data” (where it is 20x to 50x cheaper to store, manage and analyze data using traditional data warehousing technologies) to store any and all data (structured AND unstructured; internal AND external) that the organization might want to leverage. What are the benefits of having the data lake in front of the data warehouse?
- Rapid ingest of data because the data lake captures data “as-is”; that is, it does not need to create a schema before capturing the data.
- Un-handcuffing the data science team from having to try to do their analysis on the overly-expensive, overly-taxes data warehouse
- Supporting data science team’s need for rapid exploration, discovery, testing, failing, learning and re-fining of the predictive and prescriptive analytics that power the organization’s key business processes and enables new business models.
The additional benefits of this architecture:
- Provides an analytics environment where the data science team is free to explore new data sources and new analytic techniques in search of those variables and metrics that may be better predictors of business performance
- Frees up expensive data warehouse resources and opens up SLA windows by off-loading the ETL processes off of the data warehouse and put those processes into the natively parallel, scale out, less expensive data lake
Clearly having the data lake in front of the data warehouse is a win-win for both the data warehouse administrators and the data science organization.
Confusion #2: “Create multiple data lakes.”
Oh, the creation of multiple data warehouses and multiple supporting data marts has worked out soooo well for the world of data warehousing. Disparate, duplicated data warehouses and data marts are a debilitating problem in the world of data warehouses. Not only does this hinder the sharing of data across departments and lines of business, but more importantly it causes confusion and a lack of confidence by senior management in the data. How can senior management be confident that they are dealing with the “right” data when every business unit or business function has created their own data warehouse?
The result: silo’ed data and no easy way (or willingness) to share data across the business units.
For the data lake to be effective, an organization deploys only ONE data lake; a singular repository where all of the organizations data – whether the organization knows what to do with that data or not – can be made available. Organizations such as EMC are leveraging technologies such as virtualization to ensure that a single data lake repository can scale out and meet the growing analytic needs of the different business units – all from a single data lake.
Do me a big data favor and scold anyone who starts talking about data lakes (plural) instead of a data lake.
Confusion #3: Dependent upon IT to manually allocate analytic sandboxes.
Why insert a human-intensive IT intermediary into a process that can easily be managed, controlled and monitored by the system? The data science team needs to be free to explore new data sources and new analytic techniques without adding a labor-intensive, middle step to have someone allocate the analytic sandbox environment. IT as a Service, baby! This seems more like a control issue than a technology issue and fight IT’s urge to control the data science creative process.
The data lake is a game-changer not because it saves IT a whole bunch of money, but because the data lake can help the business make a whole bunch of money! Do not get caught up in the ability to build a data lake, instead focus on how the data lake can “Make me more money.”