Don’t Let Hadoop Become Hadoops!
If your organization is just starting its data analytics journey, or if you’re not sure which insights matter most, then this article is probably not for you. If that’s the case, let me recommend some quality time with Bill Schmarzo’s articles such as “Data and Economics 101” and “Determining the Economic Value of Data”.
On the other hand, if your organization is on its way, and you own or operate the infrastructure that serves up the data used to drive innovation; then read on because this one is for you. I’ll skip pontificating about the business value of leveraging data to drive innovation; you’ve probably heard that before. Instead, I’m going to focus on ways to help you make the most of your technology investment so that you’re in a better position to face the data challenges in front of you.
First, let me assure you that as the infrastructure leader, Dell EMC feels your pain. We understand that while the amount of data generated doubles every two years, your budget doesn’t. That means finding ever more creative ways to do more with less and stretching every dollar of infrastructure to process and manage increasingly larger volumes of data. It’s a difficult challenge and it gets harder every year as data growth accelerates in terms of both volume and variety. I have no magic spells or silver bullets to offer but I can offer a few guiding principles to help inform the tough decisions that lie ahead.
Know What Grows
It’s no secret that there is an explosion of data.
The proof is all around us. Everything from jumbo jets to wristwatches is spewing data. Currently, the digital universe produces over 1.7 megabytes a minute for every person on Earth and it’s accelerating. By 2020, the IDC projects the world’s stored data will reach 44 zettabytes, or 44 trillion gigabytes! By that time, projected storage capacity will hold less than 15% of the data generated. Handling that kind of volume and growth will place enormous strains on IT infrastructure and budgets.
However, the explosion in data growth does not necessarily result in uniform loads on infrastructure. One key to successfully meeting the data challenge is to recognize and adapt to the different growth rates required of your infrastructure. In today’s data driven economy, storage needs often outstrip demand for compute resources. Customers who simply add more servers to grow their Big Data and analytics capabilities often find themselves with underutilized CPU resources as they increase storage capacity. This is the case because Big Data and analytics workloads are often storage intensive, rather than compute intensive. Knowing what grows and at what rate is key to devising a sustainable long-term technology strategy to support your analytics needs.
Know the Technology
In response to the growing data challenge, companies have created a dizzying array of technologies and tools to ingest, stream, analyze, store, predict, slice, dice and peel data. The result is a complex landscape filled with lots of choices and all choices aren’t created equal. Some are dead ends; others will lock you into a specific vendor while yet others will handle the job today but won’t effectively scale for tomorrow.
Sadly, some organizations seem to think that the technology solution to their data challenge is as simple as “Just add Hadoop!” only to realize later that having too much Hadoop or Hadoop on the wrong infrastructure can be a problem in itself. Also, simply throwing hardware and software at analytics challenges is often like throwing gasoline on a fire; it makes an impressive display – just before it burns you. Applying technology to data and analytics problems often involves a measure of complexity. This is true even for something as seemingly straightforward as adding Hadoop as there are multiple technology challenges associated with Hadoop solutions as illustrated below.
In addition to the challenges, there are also lots of choices. Should you field Hortonworks, Cloudera, MapR, or BigInsights Hadoop distributions? Should you deploy compute nodes to bare metal servers like Dell PowerEdge or use converged infrastructure like Dell EMC Vblock/VxBlock or hyper-converged infrastructure such as Dell EMC VxRail and VxRack? Is Direct Attached Storage (DAS) best for your needs or should you decouple compute from storage and use Isilon scale-out storage for your data lake? These are just a few of the considerations to weigh in crafting analytics solutions that leverage Hadoop and trust me; elephants aren’t your only worry in this jungle. Understanding the technology and its implications for both the business and IT is vital to success.
Easing the Pain
One effective solution to these challenges is an enterprise class, scale-out storage solution like Dell EMC Isilon. Running Hadoop on Isilon offers several advantages.
In addition to the advantages highlighted above, Isilon also benefits customers by:
- Eliminating the costly overhead of Hadoop NameNode maintenance
- Dramatically reducing the effort associated with handling disk failures
- Helping to manage the rate of analytics storage growth by eliminating the typical 3X data replication of Hadoop
- Reducing the need for moving and staging data to make it accessible to Hadoop.
For many customers, Dell EMC Isilon represents an optimal balance of scalability, availability and performance while reducing the operational overhead associated with the care and feeding of Hadoop clusters. Granted, there are no silver bullets to magically solve all of the problems resulting from explosive data growth, but applying scale-out storage technology often helps ease the pain.
Picking a Hadoop Partner
Given the range of choices and challenges, understanding the sweet spot and tradeoffs of the various technologies and how to map them to desired capabilities in support of your business objectives is challenging. Arriving at an optimal technical solution for today’s analytics needs requires expertise across multiple disciplines; it involves business analysis, data science, data engineering and systems engineering as well as understanding both the software and the hardware that analytics solutions run on. It’s a complicated dance and you’ll want a partner.
Dell EMC is uniquely positioned to help companies meet the data challenge and successfully navigate through their own digital transformations through its unmatched solution portfolio of hardware, software and services. So whether you need help leveraging Hadoop with scale-out storage, offloading work from your existing ETL/EDW infrastructure, fielding an elastic Big Data solution or assessing where you’re at and creating a technical roadmap for the future, we can help.