Big Data

Welcoming the Data Scientist Into the Family

Bill Schmarzo By Bill Schmarzo CTO, Dell EMC Services (aka “Dean of Big Data”) May 21, 2012

As I get ready for this week’s Data Scientist Summit in Las Vegas, I was contemplating the role of the data scientists in this big data world (by the way, there is still time to drop into Las Vegas and join us!!).  A considerable amount of time has been spent in the big data discussions talking about new sources of big data (volume, variety, and velocity).  And I have certainly spent a considerable amount of time talking about the business ramifications of big data (my Big Data MBA series).  However, not enough has been said about the organizational impact of big data, especially with respect to integrating the data scientist into the business and IT organizations.

The organizational ramifications of big data are as important to driving success as either the business or the technology focus.  It’s the third leg of the stool (see Figure 1 below) that ensures that the organization has clearly defined the roles, responsibilities, and expectations of all the key stakeholders in the big data journey.

Figure 1: The Three Legs to Big Data Execution

Defining Roles, Responsibilities and Expectations

In the business intelligence (BI) space, much has already been written about defining the interactions, roles, and responsibilities between the BI and data warehouse (DW) teams and the line of business community.  Numerous books, articles, and training are available that outline the key approaches, tasks, and deliverables to this process.  Users define their requirements, the DW team builds out the data platform, the BI team builds the reports and the dashboards, and the users use the resulting works and data to monitor and make decisions about their business.

Now along comes the data scientist with different skills and a different working style.  So we need to extend the existing BI/DW best practices by integrating the data scientist to create a new 5 step Analytics Lifecycle process (see Figure 2 below).

Figure 2: Analytics Lifecycle

Step 1:  Capture Business Requirements

Everything starts (or at least should start) with the line of business stakeholders.  The first step in the Analytics Lifecycle is for the BI, DW, and data scientist teams to interview and collaborate with the business stakeholders in order to capture their business requirements.  This includes understanding their key business initiatives, business responsibilities and objectives, their key business questions and decisions, and the user experience requirements within their existing work environment.

The process of the BI/DW team interviewing the key business stakeholders to define user requirements is well defined and documented.  For example, the “Data Warehouse Lifecycle Toolkit,” written by Ralph Kimball and his team, details best practices (techniques, processes, tools) in gathering user requirements.  You can also find articles written by Ralph and team about business requirements best practices on the Kimball website and taught at The Data Warehouse Institute (TDWI).

Step 2:  Acquire and Prepare Data

This is the world of the data warehouse and data integration teams.  Their responsibilities are to take the business stakeholder requirements gathered in Step 1 and begin building the supporting data platform.  This requires

  • Building out the data staging and operational data stores
  • Assembling the necessary internal and external data
  • Cleansing, aligning, normalizing, and enriching the data
  • Building data models that make it easier for the business users to access and understand the data
  • Fine-tuning data models (e.g., aggregate tables, indices, views) to ensure reasonable end-user reporting and dashboard performance

Step 3:  Build Analytic Models

This is the world of the data scientist, and there are two key interface points that need to be considered in order to weave the data scientist role into the fabric of the Analytics Lifecycle.  First, there needs to be a tight collaboration between the data warehouse team and the data scientist around the following tasks:

  • Leverage the corporation’s data fabric and processes to acquire an existing data warehouse (and acquiring it very quickly)
  • Share any existing ETL process for cleaning and aligning traditional internal data sources
  • Leverage existing data fabric tools and capabilities to acquire external and third-party data

Secondly, the data warehouse team also needs to understand that the data science team is going to be acquiring, massaging, and integrating data from many new data sources into the analytics sandbox.  And much of this new data many never find its way into the data warehouse.

There must be some very clearly articulated responsibilities, communications, and most importantly, expectations, about these data sources.  There may be opportunities for the data warehouse team to leverage this data, but it is the responsibility of the data warehouse team to make that decision in a fashion that does not hinder the work that the data science team is doing with the data.

Step 4:  Publish Analytic Insights

After the data scientist team has modeled, verified, and created these valuable business insights, it needs to collaborate with the BI team in implementing a production-type process to publish the resulting analytic insights back into the operational environments.  These analytics insights – such as scores, probabilities, and recommendations – may find their way into existing BI reports and dashboards, but are also likely to find their way along side operational BI into the organization’s customer-facing systems (call center, email, customer support), procurement, manufacturing, supply chain, and financial systems.

Step 5:  Measure Decision Effectiveness

The final step in the Analytic Lifecycle is to ensure that the analytic insights and recommendations are actually effective.  We want to create the proper governance rules and guidelines, and properly instrument our business user systems to ensure that we can measure 1) when a recommendation or insight is acted upon by the business users and 2) the effectiveness of that recommendation or insight.  This allows the organization to “close the loop” with respect to fine-tuning the analytic models, the analytic lifecycle process, and the organization decision-making effectiveness.

Many organizations have had data analysts in their organizations for many years or even decades, using tools like SAS and SPSS.  But in many cases, these data analysts were buried in the bowels of the organization with only a passing interaction with the business community.  We need a process – the Analytics Lifecycle – to ensure that the valuable customer, product, operations, and competitive insights that they are uncovering find their way into the business.  And we need a process that leverages, not displaces, the careful and well-architected data acquisition, cleansing, enrichment, and data modeling processes that are already being done by the data warehouse and BI teams today.  The Analytics Lifecycle is designed to build upon those best practices by integrating the data science team into the fabric of running the business and helping to move the organization towards a real-time, predictive enterprise.

Bill Schmarzo

About Bill Schmarzo


CTO, Dell EMC Services (aka “Dean of Big Data”)

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting the strategy and defining the Big Data service offerings and capabilities for Dell EMC Services Big Data Practice. As the CTO for the Big Data Practice, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He’s written several white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power the organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill was ranked as #15 Big Data Influencer by Onalytica.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored Dell EMC’s Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements, and co-authored with Ralph Kimball a series of articles on analytic applications. Bill has served on The Data Warehouse Institute’s faculty as the head of the analytic applications curriculum.

Previously, Bill was the vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a masters degree in Business Administration from the University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Read More

Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

2 thoughts on “Welcoming the Data Scientist Into the Family

  1. Pingback: Welcoming the Data Scientist Into the Family – InFocus | felicevitulano

  2. Pingback: Business Intelligence Analyst or Data Scientist? What's the Difference?Reflections