Big Data

The Data Warehouse Modernization Act

Bill Schmarzo By Bill Schmarzo CTO, Dell EMC Services (aka “Dean of Big Data”) August 27, 2012

With all the political discussions about health care acts and environmental acts, I thought the timing was right for an act of our own, the “Data Warehouse Modernization Act[1].”  This act will compel the owners of stodgy, brittle, over-burdened, costly data warehouses to leverage new big data developments to transform their data warehouse into a modern data monetization powerhouse.

I’m seeing many organizations starting to have serious senior management level discussions to better understand how they can leverage some of these big data developments to upgrade, or modernize, their data warehouse.  These organizations recognize that this is low-hanging fruit, is doable immediately with existing skill sets, and will show a quick return on investment.

Let’s look at a few areas where implementing the Data Warehouse Modernization Act can deliver immediate wins in modernizing data warehouse environments.

#1: Accelerate Your Data Warehouse with MPP-based Architectures

MPP (Massively Parallel Processing)-based databases, especially software-only options, provide a cost effective, scale-out data warehouse environment that allows companies to leverage Moore’s Law[2] on performance-to-cost ratio improvements in x86 processors.  MPP databases provide a non-intrusive analytical platform/data warehouse for data discovery and exploratory work over massive amounts of data. Built on inexpensive commodity clusters, MPP databases can extend, complement, or even replace parts of your existing data warehouse, managing massive volumes of detailed data, while providing agile query, reporting, dashboards, and analytics (see Figure 1).

Figure 1:  Massively Parallel Processing (MPP) Data Warehouse Architecture

On the analytics side, once a model has been developed and business insights have been gleaned from these data sets, we can migrate the model and/or the insights into the existing data warehouse for integration into the current business intelligence environment.  Or the analytic modeling can be also done on the MPP platform, making it part of the production processes.

#2: Stop Moving Data to the Analytics; Bring the Analytics to the Data

One of the biggest developments in big data is the advent of in-database analytics.  In-database analytics addresses one of biggest shortcomings in performing advanced analytics – the requirement to move large amounts of data around.  That has caused many organizations and data scientists to have to settle with working with aggregate tables because the data transfer issue is so debilitating to the analytic exploration and discovery process.  In-database analytics reverses the process by moving the analytic algorithms to where the data is stored, accelerating the development and deployment of analytic modeling.  Elimination of data movement results in substantial benefits:

  • Moving a few terabytes can take hours. With in-database analytics, it drops to zero.
  • Because the movement of data is the most time-consuming activity in logical processing time, reducing data movement reduces the processing time by 1/N, where N is the number of processing units.

#3: Use All Your Data with a Next Generation ODS

The Hadoop Distributed File Systems (HDFS) provides a powerful yet inexpensive option for modernizing Operational Data Store (ODS) and Data Staging areas.  HDFS is a cost-effective large storage system with an intrinsic computing and analytical capability (MapReduce).  Built on commodity clusters, HDFS simplifies the acquisition and storage of diverse data sources, whether structured, semi-structured (web logs, sensor feeds), or unstructured (social media, image, video, audio). Once in the Hadoop/HDFS system, MapReduce and commercial Hadoop-based tools are available to prepare the data for loading into your existing data warehouse.  As I discussed previously in my “Understanding the Role of Hadoop In Your BI Environment” blog, the ability to “define schema on query” versus “define schema on load” simplifies amassing data from a variety of sources, even if you are not sure when and how you might use that data later (see Figure 2).

 Figure 2: Hadoop as Operational Data Store

#4: Leverage Unstructured Data to Add New Metrics to your Data Warehouse

An easy way to start building experience with Hadoop and MapReduce is to use these technologies to create new metrics out of one of your unstructured data sources (consumer comments, Facebook) that can be fed into your data warehouse. Think “Moneyball[3]” and the ability to leverage unstructured data sources (social, mobile, consumer comments, emails, doctors’ notes, claims descriptions) to identify new metrics that are better predictors of performance.  Your existing data warehouse is a treasure trove of key performance indicators and metrics used to monitor business performance.  Now with the addition of new data sources like social media, mobile, web logs, and sensor logs, we have the opportunity to use Hadoop and MapReduce to parse through these data sources to identify new business performance metrics that can be integrated into our existing data warehouse (see process in Figure 3).

 Figure 3:  Hadoop/MapReduce Metrics Parsing Process

Once these new metrics are in the data warehouse, they can be used to enhance your existing business intelligence queries, reports, dashboards, and analysis (see Figure 4).

Figure 4: Integrating Social Media Metrics Into Your BI Environment

Note: this opportunity also places companies in a good position as Hadoop continues its assimilation into the relational database market.  Being able to create metrics and process data on Hadoop, having tools like HBase and Hive that are evolving quickly, and having BI tools connect directly to HDFS, may start people questioning why they need to move data to a relational database at all.

Modernize Your Data Warehouse Today

In the world of revolutionary, game-changing big data developments, data warehouse modernization may sound like an evolutionary development.  But it is something that can be executed today, with existing data warehouse skills, and represents a simple first step toward gleaning immediate business value and organizational agility from big data technologies.  Why are you waiting?

By the way, don’t forget to register for my upcoming webcast “Analyze This! Best Practices For Big And Fast Data” as we discuss one of the most important trends in data management – the emergence of big and fast data.  See you there!


[1] Special thanks to Dr. Pedro Desouza for his insights and hands-on experience in crafting this blog

[2] Moore’s law is the observation that over the history of computing hardware, the number of transistors on integrated circuits doubles approximately every two years. The result is the doubling of computing power at the same cost every 18 to 24 months.  http://en.wikipedia.org/wiki/Moore%27s_law

[3] “Moneyball: The Art of Winning an Unfair Game” is a book by Michael Lewis, published in 2003, about the Oakland Athletics baseball team and its general manager Billy Beane. Its focus is the team’s analytical, evidence-based, sabermetrics approach to assembling a competitive baseball team, despite Oakland’s disadvantaged revenue situation. http://en.wikipedia.org/wiki/Moneyball

Bill Schmarzo

About Bill Schmarzo


CTO, Dell EMC Services (aka “Dean of Big Data”)

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Dell EMC’s Big Data Practice. As a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Read More

Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

0 thoughts on “The Data Warehouse Modernization Act

  1. We are already seeing and using Hadoop as ODS. I will suggest one addition to figure 2, if i may, data from analytical sandbox going into EDW.
    As I noticed we have data scientist and analyst using tools like R or other desktop or server/desktop tools for their analysis. So it will be best to get the result or formulas store back into EDW for reporting purpose as well as historical evaluation of these formulas.

    Thanks,
    Milind

  2. Milind, thanks for your comment. Quick question: do you take the analytic results directly from the Analytic Sandbox into the EDW or are you running the analytic results through any ETL or MDM processes before pushing them into the EDW?