Modernizing Your Data Warehouse Part 2
This is a topic that I wrote about many moons ago, but is one that still comes up time and time again. It is of particular significance to organizations that already have a data warehouse and business intelligence investment yet want to leverage that investment to introduce advanced analytics into their organization. When I get the question of how best to do so, I utilize the chart below (Figure 1) as a way to highlight a high-level architecture that:
- Preserves (and can even strengthen) a data warehouse environment
- Enables the integration of advanced data management and analytics capabilities into the organization
Let’s review the 3 components of the below architecture and provide a “How To” guide for you to get here.
Action 1: Upgrade Your EDW
Today’s data warehouse platforms are very expensive. As a result, most organizations limit how much data they provide in their data warehouse, opting for 13 months of summarized data versus 13 years of detailed data. Unfortunately, it’s in that detailed data that many invaluable insights about your customers, products, campaign, partners and operations can be found.
For example, over the past 13 years, the economy has gone through 2 full economic cycles where it was flying high, collapsed, and then climbed back up only to repeat the cycle all over again. By looking at each of your customer’s product purchase patterns over those 2 economic cycles, you can closely predict when someone is personally impacted by the economic downtime – both positively as well as negatively. For example, if you’re a grocery chain, you can monitor your customer’s market baskets and uncover changes in their purchase patterns that indicate changes in their economic conditions. You could see your customer moving from expensive to lower-cost products (from steak to hamburger), start using more coupons, or increase their purchases of private label products.
At the individual customer level, the grocery chain probably knows more about the economic conditions across the United States than any financial institution or government agency. The grocery chain would be the first to see the shift in buying patterns at the individual level, and could aggregate those buying patterns into higher levels such as geographic/zip+4, demographic, and behavioral customer categories.
Organizations can upgrade their data warehouse capabilities by using modern MPP (multi-parallel processing), open source, scale-out technologies that can dramatically drive down enterprise data warehouse (EDW) costs (take advantage of the 20x price advantage offered by these new, modern, scale-out technologies versus traditional EDW technologies). Plus you can focus your EDW on only doing EDW things, and off-load non-EDW activities like ETL and advanced analytics to a more appropriate, more cost-effective environment (see Figure 2).
Action 2: Upgrade Your ETL Processes
Doing ETL (Extract, Transform, Load) within your existing data warehouse is common today. However, if your data warehouse is already overloaded and overly expensive, why do that batch-centric, data management heavy work in an expensive environment? That’s like having a Tesla haul turnips around the farm.
Instead, optimize your ETL processes by moving ETL off of your expense EDW platform and doing the ETL/ELT (Extract, Load, Transform) work in an inherently parallel, open source, cost-effective, scale-out environment like Hadoop/MapReduce. Doing the ETL (as well as ELT) within Hadoop allows you to leverage that natively parallel environment to bring to bear the appropriate compute capabilities at the appropriate times to get the job done more quickly and more cost effectively.
As we’ve discussed before, not only does using Hadoop for your ETL/ELT work make sense from a cost and processing effectiveness perspective, but it also gives you the capability to create new metrics that are difficult to do using traditional ETL tools. For example, the creation of new metrics around customer and product performance leveraging frequency (how often), recency (how recently) and sequencing (in what order), can yield new insights that might be better predictors of customer behaviors and product performance.
As a side benefit, this Hadoop front-end data repository can store ALL the organization’s data in a low-cost HDFS environment as-is (without the added burden of pre-defining your data schemas), and then feed both the production enterprise data warehouse environment and high-velocity analytics sandbox as necessary (see Figure 3).
Action 3: Introduce Your Analytics Sandbox
This architecture supports the introduction of a separate analytics environment that off-loads the analytics being done today on your overly expensive EDW. This separate environment also gives the data science team an on-demand, fail-safe environment for quickly addressing immediate business opportunities independent of the EDW’s strict governance and SLA rules.
The analytics environment couldn’t be more different than your data warehouse environment. Your data warehouse environment is a production environment that needs to support the regular (daily, weekly, monthly, quarterly, annual) production of operational and management reports, as well as dashboards that are used to run the business. In order to do that, most data warehouse environments have strict SLA’s, are heavily governed, and make use of standard tools in order to meet those production requirements while holding down costs.
An analytics environment, on the other hand, is much more ad hoc and on-demand driven. The analytics environment must support continuous exploration of those new data sources that are essential to uncovering actionable insights about your customers, products and operations. It must allow for the development of new data management and analytic tools/techniques that are the most appropriate for your analytics and data science team (see Figure 4).
These are certainly marvelous times to be in the data business. Technology innovations and maturation advancements are providing new architectural and infrastructure options to enable organizations to extend and enhance their existing data warehouse and ETL investments, while introducing an analytics environment that can off-load the work that is being wasted on the data warehouse. This all occurs simultaneously as your data science team is provided an environment where they are free to explore new data sources and data enrichment techniques to tease out new actionable insights about your customers, products, and operations. Watch this space for continuing technology innovations and the impacts that these innovations could have on your data and analytics capabilities.