Data Lake and the Cloud: Pros and Cons of Putting Big Data Analytics in the Public Cloud
A question that I surprisingly never get asked is “what about putting the data lake in The Cloud?” Now maybe I’m not asked that question because organizations are still confused as to what is a data lake? Or maybe I’m not asked that question because everyone (but me) already knows the answer?
Well, I thought I’d partner with my super smart friend Brandon Kaier (twitter: @bkaier) to write a blog then for mostly my benefit (wouldn’t be the first time). I need to start understanding how the data lake benefits or doesn’t benefit from the cloud. There must be some overlap because both are focused on driving down the economics of managing IT resources and users’ ability to get access to those resources
But I get this feeling that there are some serious considerations and issues about how organizations should be thinking about the data lake in The Cloud. I bet that the most serious issues come not from storing and managing the data itself. My bet is that the issues arise in providing an agile, fail fast, analytic sandbox environment, securely, with data and analytic mobility, features that we would expect out of The Cloud. Let me explore that further.
What is a Data Lake?
There are plenty of technical definitions you can google about what “it” is. More importantly let’s start the conversation by making sure that we understand what a data lake “does” and what it “means” to the business: Here is what I think is most important about a data lake:
A Data Lake is a SINGLE repository for storing (either physically or logically) all the organization’s data including data generated from internal transactions and interactions as well as data gathered from third party and publicly available sources. The Hadoop Distributed File System (HDFS) is the preferred data lake platform because it provides a cost-effective, powerful, agile, scale out environment for assembling, preparing, aligning, enriching, and analyzing diverse structured and unstructured data sources
The Data Lake provides the following benefits:
- Rapid ingest of data as-is; it is not necessary to build a schema first or transform in order to ingest the data
- Can store structured (tables, comma delimited, RDBMS), semi-structured (logs files, clickstream, social media) and unstructured (text, video, photos, audio) data
- Leverage natively parallel, scale out Hadoop environment to off-load ETL process off of expensive data warehouse environment
- TWO BIGGIES! Frees up the data science team from being dependent on the highly structured, less agile data warehouse for their rapid data ingest / fail fast / learn faster model development, testing and refinement processes. Allows for the data to be interrogated with multiple tools simultaneously. The combination of these two capabilities allows the Data Scientists to network their efforts for significantly better results.
What is The Cloud?
The Cloud is a general term for the delivery of hosted services over the Internet. The Cloud should enable companies to consume compute resources as a utility — just like electricity — rather than having to build and maintain computing infrastructures in-house.
This tends to be what people, especially in the lines of business, think of when they hear the phrase “The Cloud.”
Or maybe Jason Segel in the movie “Sex Tape” got it right:
When many folks think of The Cloud, they immediately think of the Amazon and Google public clouds providing an inexpensive option for organizations that quickly want to stand up a computing and related storage environment. One can literally buy this environment with a credit card and (roughly) only pay for what computing and storage is actually needed. Again, this perception is especially true within the lines of business.
Why Not Put The Data Lake In The Cloud?
If The Cloud is delivering resources to me in utility model it seems like a natural match to put the data lake in the cloud, in fact one might call a Data Lake a purpose built cloud. The conversation just isn’t that simple. There are some important considerations before one should make the jump to The Cloud, especially the public cloud. For the business these considerations include:
- Personally identifiable information (PII), sensitive personal information (SPI), information covered by the Health Insurance Portability and Accountability Act (HIPAA) and other confidential and sensitive data cannot be put in the public cloud. There are rules and substantial fines (and firings) for organizations that break those rules.
- Confidential financial data (such as sales, orders, returns, margins, profits) probably should not be put in the public cloud. If this type of data were to get into the wrong hands, it could cause organizations major financial and business operational problems and potentially substantial losses of market value.
- Can you trust the security of your company’s Intellectual Property to a piece of paper? More and more data Science and analytic models are becoming the IP that fuels new business processes, models or entire businesses. The point here is the value of your IP significant enough to trust another’s opinion of secure? Recent news should be enough to give everyone pause. In fact, the need to protect these assets is so paramount to some organizations that they are going so far as to purchase a stake in technology companies to ensure that the company whose technology backs these new models can’t make changes to the technology that would put the business model at risk.
For the Data Science Team, the biggest challenges are physical – it’s just darn difficult to move large volumes of data between an on-premise environment and the public cloud (and when it needs to be done repeatedly, it can get very expensive very quickly). In the world of data science, data scientists want to work with large sets of very detailed (highly granular) data that can change often.
For example, let’s say that we want to determine (predict) how valuable a customer might be to the organization. Organizations should be able to easily calculate (if they don’t have data silos) how valuable a customer is today by looking at that customer’s purchase history, returns history, payment history, product margins, frequency of purchases, time sequence of purchases and any costs associated with selling to and servicing that customer. But let’s say that we are trying to predict how valuable a customer might be to the organization, in which case we might want to bring in other data sources such as:
- Social media data in order to create an advocacy score and determine their likelihood to recommend. We might also want to mine the social media data (using graph analysis) to determine that customer’s social network and what sort of influence (net promoter score) that customer has on others within their network.
- Clickstream data in order to determine how often this customer accesses our website to research products, seek support, make purchases or any number of activities on the website.
- Mobile data in order to determine how often this customer accesses the mobile app, what they are doing when they access to mobile app, and if they are subsequently sharing any information with others via the mobile app and social media channels.
Each of these data sources are quite voluminous, the integration of the organization’s financial and operational data with the organization’s social media data, clickstream data with mobile data requires a significant amount of bandwidth just to move the data into and out of the different data science sandboxes. Exacerbating the problem is the “fail fast / learn faster” mentality of the Data Science process. Your data scientist just loaded two extremely large data sets into your data lake to discover that those data sets provide no appreciable value or insight into their problem set so they just want to delete it. There are incremental costs to the movement of that data to The Cloud.
So how would an organization move this data to the public cloud cost effectively? Amazon’s solution (which seems very economical) is to use their snowball product. What is snowball? It’s a large storage device that organizations have delivered by, yep, data transfer courtesy of FedEx!
As the Amazon website states: “AWS Import/Export Snowball – Transfer 1 Petabyte Per Week Using Amazon-Owned Storage Appliances.” This is nothing new technologically; it has existed for decades by the name “Sneakernet”. It is known to have very high bandwidth capabilities but incredibly long latency. That’s not exactly my idea of how to best support the data science “rapid data ingest / fail fast / learn faster” model development, testing and refinement processes.”
The data lake in the public cloud problem is a physics problem – data movement is still the bitch of our industry. Let’s face it, the real problem with Big Data is that it is big and big things are hard to move. Given the data science team’s need for “rapid data ingest / fail fast / learn faster” model development, testing and refinement processes”, I just don’t see how the public cloud plays in the data lake deployment other than to support one-off, skunk works types of projects that once any level of value is demonstrated doesn’t get moved back to the on-prem cloud.
That said, there is a place for The Cloud in the broader ecosystem. Global deployment or access to the results of the data science is an excellent use case. There are many organizations that are currently doing just that. The data science is all done in house but the applications that interface or deliver the analytics results (recommendations, scores) are being hosted in The Cloud allowing for global access to the results.
Am I missing something?