Big Data

Data Lake and the Cloud: Pros and Cons of Putting Big Data Analytics in the Public Cloud

Bill Schmarzo By Bill Schmarzo CTO, Dell EMC Services (aka “Dean of Big Data”) October 17, 2016

A question that I surprisingly never get asked is “what about putting the data lake in The Cloud?”  Now maybe I’m not asked that question because organizations are still confused as to what is a data lake?  Or maybe I’m not asked that question because everyone (but me) already knows the answer?

Well, I thought I’d partner with my super smart friend Brandon Kaier (twitter: @bkaier) to write a blog then for mostly my benefit (wouldn’t be the first time).  I need to start understanding how the data lake benefits or doesn’t benefit from the cloud.  There must be some overlap because both are focused on driving down the economics of managing IT resources and users’ ability to get access to those resources

But I get this feeling that there are some serious considerations and issues about how organizations should be thinking about the data lake in The Cloud. I bet that the most serious issues come not from storing and managing the data itself. My bet is that the issues arise in providing an agile, fail fast, analytic sandbox environment, securely, with data and analytic mobility, features that we would expect out of The Cloud.  Let me explore that further.

What is a Data Lake?

There are plenty of technical definitions you can google about what “it” is. More importantly let’s start the conversation by making sure that we understand what a data lake “does” and what it “means” to the business:  Here is what I think is most important about a data lake:

A Data Lake is a SINGLE repository for storing (either physically or logically) all the organization’s data including data generated from internal transactions and interactions as well as data gathered from third party and publicly available sources.  The Hadoop Distributed File System (HDFS) is the preferred data lake platform because it provides a cost-effective, powerful, agile, scale out environment for assembling, preparing, aligning, enriching, and analyzing diverse structured and unstructured data sources

The Data Lake provides the following benefits:

  • Rapid ingest of data as-is; it is not necessary to build a schema first or transform in order to ingest the data
  • Can store structured (tables, comma delimited, RDBMS), semi-structured (logs files, clickstream, social media) and unstructured (text, video, photos, audio) data
  • Leverage natively parallel, scale out Hadoop environment to off-load ETL process off of expensive data warehouse environment
  • TWO BIGGIES! Frees up the data science team from being dependent on the highly structured, less agile data warehouse for their rapid data ingest / fail fast / learn faster model development, testing and refinement processes. Allows for the data to be interrogated with multiple tools simultaneously. The combination of these two capabilities allows the Data Scientists to network their efforts for significantly better results.

What is The Cloud?

The Cloud is a general term for the delivery of hosted services over the Internet. The Cloud should enable companies to consume compute resources as a utility — just like electricity — rather than having to build and maintain computing infrastructures in-house.

This tends to be what people, especially in the lines of business, think of when they hear the phrase “The Cloud.”

Or maybe Jason Segel in the movie “Sex Tape” got it right:

When many folks think of The Cloud, they immediately think of the Amazon and Google public clouds providing an inexpensive option for organizations that quickly want to stand up a computing and related storage environment.  One can literally buy this environment with a credit card and (roughly) only pay for what computing and storage is actually needed. Again, this perception is especially true within the lines of business.

Why Not Put The Data Lake In The Cloud?

If The Cloud is delivering resources to me in utility model it seems like a natural match to put the data lake in the cloud, in fact one might call a Data Lake a purpose built cloud. The conversation just isn’t that simple.  There are some important considerations before one should make the jump to The Cloud, especially the public cloud.  For the business these considerations include:

  • Personally identifiable information (PII), sensitive personal information (SPI), information covered by the Health Insurance Portability and Accountability Act (HIPAA) and other confidential and sensitive data cannot be put in the public cloud. There are rules and substantial fines (and firings) for organizations that break those rules.
  • Confidential financial data (such as sales, orders, returns, margins, profits) probably should not be put in the public cloud. If this type of data were to get into the wrong hands, it could cause organizations major financial and business operational problems and potentially substantial losses of market value.
  • Can you trust the security of your company’s Intellectual Property to a piece of paper? More and more data Science and analytic models are becoming the IP that fuels new business processes, models or entire businesses. The point here is the value of your IP significant enough to trust another’s opinion of secure? Recent news should be enough to give everyone pause. In fact, the need to protect these assets is so paramount to some organizations that they are going so far as to purchase a stake in technology companies to ensure that the company whose technology backs these new models can’t make changes to the technology that would put the business model at risk.

For the Data Science Team, the biggest challenges are physical – it’s just darn difficult to move large volumes of data between an on-premise environment and the public cloud (and when it needs to be done repeatedly, it can get very expensive very quickly).  In the world of data science, data scientists want to work with large sets of very detailed (highly granular) data that can change often.

For example, let’s say that we want to determine (predict) how valuable a customer might be to the organization.  Organizations should be able to easily calculate (if they don’t have data silos) how valuable a customer is today by looking at that customer’s purchase history, returns history, payment history, product margins, frequency of purchases, time sequence of purchases and any costs associated with selling to and servicing that customer.  But let’s say that we are trying to predict how valuable a customer might be to the organization, in which case we might want to bring in other data sources such as:

  • Social media data in order to create an advocacy score and determine their likelihood to recommend. We might also want to mine the social media data (using graph analysis) to determine that customer’s social network and what sort of influence (net promoter score) that customer has on others within their network.
  • Clickstream data in order to determine how often this customer accesses our website to research products, seek support, make purchases or any number of activities on the website.
  • Mobile data in order to determine how often this customer accesses the mobile app, what they are doing when they access to mobile app, and if they are subsequently sharing any information with others via the mobile app and social media channels.

Each of these data sources are quite voluminous, the integration of the organization’s financial and operational data with the organization’s social media data, clickstream data with mobile data requires a significant amount of bandwidth just to move the data into and out of the different data science sandboxes. Exacerbating the problem is the “fail fast / learn faster” mentality of the Data Science process. Your data scientist just loaded two extremely large data sets into your data lake to discover that those data sets provide no appreciable value or insight into their problem set so they just want to delete it. There are incremental costs to the movement of that data to The Cloud. idk

So how would an organization move this data to the public cloud cost effectively?  Amazon’s solution (which seems very economical) is to use their snowball product.  What is snowball?  It’s a large storage device that organizations have delivered by, yep, data transfer courtesy of FedEx!

As the Amazon website states: “AWS Import/Export Snowball – Transfer 1 Petabyte Per Week Using Amazon-Owned Storage Appliances.”  This is nothing new technologically; it has existed for decades by the name “Sneakernet”.  It is known to have very high bandwidth capabilities but incredibly long latency. That’s not exactly my idea of how to best support the data science “rapid data ingest / fail fast / learn faster” model development, testing and refinement processes.”


The data lake in the public cloud problem is a physics problem – data movement is still the bitch of our industry. Let’s face it, the real problem with Big Data is that it is big and big things are hard to move. Given the data science team’s need for “rapid data ingest / fail fast / learn faster” model development, testing and refinement processes”, I just don’t see how the public cloud plays in the data lake deployment other than to support one-off, skunk works types of projects that once any level of value is demonstrated doesn’t get moved back to the on-prem cloud.

That said, there is a place for The Cloud in the broader ecosystem. Global deployment or access to the results of the data science is an excellent use case. There are many organizations that are currently doing just that. The data science is all done in house but the applications that interface or deliver the analytics results (recommendations, scores) are being hosted in The Cloud allowing for global access to the results.

Am I missing something?

Bill Schmarzo

About Bill Schmarzo

CTO, Dell EMC Services (aka “Dean of Big Data”)

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Dell EMC’s Big Data Practice. As a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Read More

Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

12 thoughts on “Data Lake and the Cloud: Pros and Cons of Putting Big Data Analytics in the Public Cloud

  1. excellent summary of the terminology, challenges, broad use case scenarios, security implications and compliance issues relating to Cloud and Big Data.

    it would be great to see a follow up on data and private or hybrid cloud which can aleviate the data silo concerns, ensure regulatory compliance whilst providing the opportunities presented by big data analytics.

    Dell EMC have multiple reference architectures that enable deployment of any scale private or hybrid cloud with all the necessary software layers that optimise your ability to test, deploy and analyse data swiftly and applications to transform that data analysis into real tangible business value.

    • Thanks David, and good suggestion. I’ll put this on my To Do list (and bring in someone with a lot more cloud experience to help!!).

      Yes, Dell EMC certainly is focused on helping our clients deal with these challenging cloud questions. What I learned from writing this blog was that there are a lot more difficult questions that need to be addressed than I thought.

      Live and learn!

  2. Nice article – Thanks Bill for writing this. In my opinion, there are 2 main issues to be resolved:
    1. Privacy and Security of the Data – Evaluating this will determine IF the data lake “should” be moved onto a different cloud…; and, WHAT part of the data needs to be moved – be to a PR, PU or HR cloud..; most likely the former. However, now a days with HDFS based Data-Centers typically being equated to a company’s PR cloud – I essentially don’t see a need for data to be moved between DC and PR clouds.
    2. HOW to move – This is where Snowball (or even large SDD based HDDs) or P2P software like Binfer, eMule etc can help. However, beware of malware threats with P2P transfer.

    • Padmakumar, thanks for the additional information. Great when others can share their experiences, especially given the many difficult questions that need to be considered with respect to the cloud. I am certainly not very experienced with the cloud, so this sort of information greatly helps. Thanks!

  3. On the cloud question – there are two ways to do this. Either take data to processing or take processing to the data. For instance if data is stored in AWS, why not run data science programs on AWS itself. AWS provides a lot of ML/DS products.

    • The public cloud data lake is a great option, if all of your data is already in the public cloud. But many organizations either have massive data sets (like Point of sales data) that are too voluminous to move to the public cloud or highly confidential data (finance data, PII or HIPPA data) that can not be moved to the public cloud without substantial regulatory and compliance risks and associated fines.

      I think organizations will likely pursue a hybrid cloud strategy where data that is already in the cloud (web clicks, CRM) stays in the cloud and other data sets (like POS, financial and customer loyalty/PII) stays on-prem.

      The challenge comes when the data science team wants to integrate detailed point-of-sales data with customer loyalty card data (both probably on-prem) with CRM data (that may be in the cloud) at the level of the individual consumer. In that case, the data science team may want to bring the CRM data on-prem to accelerate the data transformation, enrichment and analytic modeling processes that are taking place at the individual consumer level (to create consumer propensity scores around next best offer, attrition, advocacy/LTR and fraud).

      If that happens and once that CRM data is on-prem, why also have it in the public cloud?

    • Totally agree, moving large volumes of data is impractical, which is why I suspect that most organizations will keep most of their data in an on-prem cloud and use the public cloud for skunkworks exploration. And grabbing data off of cloud-based systems like Workday and SFDC is something that smart organizations are only going to want to do once.

      All this points to the importance of having a clearly-articulated cloud strategy with cloud governance policies, so that organizations can get the maximum value with appropriate security from a hybrid cloud strategy.