Understanding the Role of Hadoop In Your BI Environment
In my discussions about the assault of big data technologies on the traditional Business Intelligence (BI) and data warehouse (DW) markets, sometimes it is hard to identify specific use cases where a big data technology – like Hadoop – can help your BI environment. Many of these big data use cases are really still being developed, but here is one that I’m starting to see more and more – using Hadoop to replace or augment your existing ETL, data staging, and operational data store (ODS) architecture. And as is typical when I get into some technically complex topics, I want to thank Dr. Pedro DeSouza for his help and insights in pulling together this blog.
Stage 1: Hadoop for Data Staging and ODS
I am seeing several client engagements where the BI/DW team is looking to use Hadoop to simplify, accelerate, and enhance their existing ETL and data staging processes. Hadoop brings at least two significant advantages to your ETL and data staging processes. The first is the ability to ingest massive amounts of data as-is. That means that you do not need to pre-define the data schema before loading data into Hadoop. This includes both traditional transactional data (e.g., point-of-sale transactions, call detail records, general ledger transactions, call center transactions), but also unstructured internal data (like consumer comments, doctor’s notes, insurance claims descriptions, and web logs) and external social media data (from social media sites such LinkedIn, Pinterest, Facebook and Twitter). So regardless of the structure of your incoming data, you can rapidly load it all into Hadoop, as-is, where it then becomes available for your downstream ETL, DW, and analytic processes (see Stage 1 in Figure 1 below).
Figure 1: Hadoop as ODS/Data Stage
The second advantage that Hadoop brings to your BI/DW architecture occurs once the data is in the Hadoop environment. Once it’s in your Hadoop ODS, you can leverage the inherently parallel nature of Hadoop to perform your traditional ETL work of cleansing, normalizing, aligning, and creating aggregates for your EDW at massive scale.
Many ETL vendors, like Pentaho, Talend, and Datameer are modifying their products to seamlessly create MapReduce parallel ETL jobs. These vendors provide drag-and-drop UIs for generating ETL MapReduce jobs, removing a great deal of complexity from the data integrator/developer.
Once the raw data is in the Hadoop environment, developers can do data transformations and enrichments that were not easy to do before including:
- Parse complex, unstructured data feeds (like consumer comments, web logs, and twitter feeds) to capture key data and metrics of importance (e.g., visitor id, session id, site id, display ad id, display ad location) that can then be integrated with the existing structured data in your EDW. For example, imagine the business possibilities if you were able to mine the wealth of social media data on your customers to identify their interests, passions, associations, and affiliations, and then integrate those new customer insights with the existing customer data that is meticulously maintained in your CRM system.
- Create advanced composite metrics that require you to process many days, weeks, or even months of history. I covered some of these composite metrics in a previous blog, but creating metrics around frequency, recency, and sequencing is possible in a Hadoop environment that is not easily done within your traditional ETL environment. These composite metrics facilitate more detailed and complex analysis yielding key performance indicators that might be better indicators and/or predictors of performance.
Stage 2: Hadoop for Feeding Your DW
After doing all of this parsing of unstructured data for new customer and business metrics and data enrichment to create new composite metrics, the output of the Hadoop ODS can then feed your standard EDW. The Hadoop ODS has the advantage of being able to create structure out of unstructured data, which can then be integrated with your existing structured, transactional data in the EDW (see Stage 2 in Figure 2).
Figure 2: Hadoop For Feeding Your DW
And because of the massive scalability of Hadoop and the ability to ingest massive amounts of data quickly, you can significantly accelerate your traditional ETL processes and more easily fulfill your EDW SLA’s. Not only can you shrink the latency between when the data transaction or event occurs and when it’s available in your EDW, but you can also provide more granular, detailed data in your EDW, especially if your EDW is based on an MPP architecture (which reduces the need for indices, aggregate tables and materialized views, thereby saving more even more EDW loading time and management effort).
Stage 3: Hadoop for Feeding Your Analytic Sandbox
The third stage of Hadoop as your ODS affects the self-provisioning and rapid and frequent iterations typically required from your analytics sandbox environment. In this environment, the data scientists can grab what data they need out of the Hadoop ODS without worrying about impacting the EDW environment. They can select whatever level of granularity they need from whichever data sources they need in order to build, test, refine, and publish their analytic results (see Stage 3 in Figure 3).
Figure 3: Hadoop For Feeding Your Analytic Sandbox
With the Hadoop ODS, the data scientist have the ability to store and access data they “might” need from an analytics perspective; data that may never find its way into the EDW. For example, the Data Scientists might want to store large amounts of social media or web log data, or large varieties of widely available third-party data (from places like data.gov) to enhance analytic modeling. The Data Scientists would grab this data out of the Hadoop ODS when and as they needed it, depending upon their current analytic needs.
Business Example of the Power of Hadoop as an ODS
Let’s walk through an example as to how this Hadoop ODS environment can support the types of on-demand analytics that the EDW is so poorly suited to support.
Recently there were floods in Thailand that had major impact on the supply chains of many manufacturers. In one case, the CFO asked their BI team to come up with an estimate of the impact on the end of quarter Earnings Per Share (EPS). To make matters worse, the BI team only had 3 weeks to complete this assessment. This is the perfect type of analytics request for an analytic sandbox because:
- The analytics team is going to need to get access to lots of different data sources rapidly – both detailed structured data as well as some unstructured web log and media feed data. The data requests could change rapidly (e.g., adding third-party market data, local economic, and weather data) as the analysts proceed through the analysis process.
- Much of the data being used for this analysis will never find its way into the DW, but needs to be housed in an environment that stores the data as-is and can be rapidly accessed and integrated into the analysts processes just in case they need it.
- They need to come up with an estimate – a range of probabilities – and not a precise answer.
- This is likely a one-off request, so it’s not critical to build a repeatable process for this analytic request.
In this example, the analytics team asked for not only their own supply chain data, but they also asked for the supply chain data for their top 25 suppliers (which represented about 90% of their components, and 100% of their Tier 1 components). They also pulled in customer, inventory, orders, sales pipeline and manufacturing data from their internal systems, as well as the manufacturing bill of materials from the internal manufacturing system. This data was then augmented with third-party data on weather forecasts, traffic and road conditions, and local support and recovery information. All of this data could be stored in the Hadoop ODS waiting for the time when the Data Scientist team would or might need that data.
The analytics team was able to use this data to create models that answered the following questions:
- Are their forecasted weather problems or local recovery and traffic issues that will exacerbate the supply chain problem?
- Which of our products are most impacted by the shortfalls (note: this required looking into their suppliers’ supply chains to identify common subcomponent suppliers across their entire supplier network)?
- Which of our top customers are most impacted by the component shortages?
- What are the top sales opportunities in the sales pipeline based upon margin impact?
- How best do we allocate our existing products to our most important customers and deals?
- What is the likely impact on Earnings Per Share with a 95% confidence level?
The result was that the analytics team was able to not only give the CFO the estimated EPS impact (at a 95% confidence level range) but was also able to provide insights into how to optimize existing inventory allocation to minimize the EPS impact.
By the way, once this analysis was done, the analytics team packaged up the analysis and archived it. While they were unlikely to be doing this type of analysis again anytime soon, they needed to archive the analysis for compliance and possibly legal reasons.
As you can see, using Hadoop as your ODS and data staging area can deliver new, compelling business benefits (and not to mention that it’s probably significantly cheaper and more agile then your existing ETL environment). And I expect that we’re going to start seeing more use cases for Hadoop within our BI and DW environments. The cost benefits, the processing power, the access to lower-latency data, and the overall simpler architecture are just too compelling to ignore.