Welcoming the Data Scientist Into the Family
As I get ready for this week’s Data Scientist Summit in Las Vegas, I was contemplating the role of the data scientists in this big data world (by the way, there is still time to drop into Las Vegas and join us!!). A considerable amount of time has been spent in the big data discussions talking about new sources of big data (volume, variety, and velocity). And I have certainly spent a considerable amount of time talking about the business ramifications of big data (my Big Data MBA series). However, not enough has been said about the organizational impact of big data, especially with respect to integrating the data scientist into the business and IT organizations.
The organizational ramifications of big data are as important to driving success as either the business or the technology focus. It’s the third leg of the stool (see Figure 1 below) that ensures that the organization has clearly defined the roles, responsibilities, and expectations of all the key stakeholders in the big data journey.
Figure 1: The Three Legs to Big Data Execution
Defining Roles, Responsibilities and Expectations
In the business intelligence (BI) space, much has already been written about defining the interactions, roles, and responsibilities between the BI and data warehouse (DW) teams and the line of business community. Numerous books, articles, and training are available that outline the key approaches, tasks, and deliverables to this process. Users define their requirements, the DW team builds out the data platform, the BI team builds the reports and the dashboards, and the users use the resulting works and data to monitor and make decisions about their business.
Now along comes the data scientist with different skills and a different working style. So we need to extend the existing BI/DW best practices by integrating the data scientist to create a new 5 step Analytics Lifecycle process (see Figure 2 below).
Figure 2: Analytics Lifecycle
Step 1: Capture Business Requirements
Everything starts (or at least should start) with the line of business stakeholders. The first step in the Analytics Lifecycle is for the BI, DW, and data scientist teams to interview and collaborate with the business stakeholders in order to capture their business requirements. This includes understanding their key business initiatives, business responsibilities and objectives, their key business questions and decisions, and the user experience requirements within their existing work environment.
The process of the BI/DW team interviewing the key business stakeholders to define user requirements is well defined and documented. For example, the “Data Warehouse Lifecycle Toolkit,” written by Ralph Kimball and his team, details best practices (techniques, processes, tools) in gathering user requirements. You can also find articles written by Ralph and team about business requirements best practices on the Kimball website and taught at The Data Warehouse Institute (TDWI).
Step 2: Acquire and Prepare Data
This is the world of the data warehouse and data integration teams. Their responsibilities are to take the business stakeholder requirements gathered in Step 1 and begin building the supporting data platform. This requires
- Building out the data staging and operational data stores
- Assembling the necessary internal and external data
- Cleansing, aligning, normalizing, and enriching the data
- Building data models that make it easier for the business users to access and understand the data
- Fine-tuning data models (e.g., aggregate tables, indices, views) to ensure reasonable end-user reporting and dashboard performance
Step 3: Build Analytic Models
This is the world of the data scientist, and there are two key interface points that need to be considered in order to weave the data scientist role into the fabric of the Analytics Lifecycle. First, there needs to be a tight collaboration between the data warehouse team and the data scientist around the following tasks:
- Leverage the corporation’s data fabric and processes to acquire an existing data warehouse (and acquiring it very quickly)
- Share any existing ETL process for cleaning and aligning traditional internal data sources
- Leverage existing data fabric tools and capabilities to acquire external and third-party data
Secondly, the data warehouse team also needs to understand that the data science team is going to be acquiring, massaging, and integrating data from many new data sources into the analytics sandbox. And much of this new data many never find its way into the data warehouse.
There must be some very clearly articulated responsibilities, communications, and most importantly, expectations, about these data sources. There may be opportunities for the data warehouse team to leverage this data, but it is the responsibility of the data warehouse team to make that decision in a fashion that does not hinder the work that the data science team is doing with the data.
Step 4: Publish Analytic Insights
After the data scientist team has modeled, verified, and created these valuable business insights, it needs to collaborate with the BI team in implementing a production-type process to publish the resulting analytic insights back into the operational environments. These analytics insights – such as scores, probabilities, and recommendations – may find their way into existing BI reports and dashboards, but are also likely to find their way along side operational BI into the organization’s customer-facing systems (call center, email, customer support), procurement, manufacturing, supply chain, and financial systems.
Step 5: Measure Decision Effectiveness
The final step in the Analytic Lifecycle is to ensure that the analytic insights and recommendations are actually effective. We want to create the proper governance rules and guidelines, and properly instrument our business user systems to ensure that we can measure 1) when a recommendation or insight is acted upon by the business users and 2) the effectiveness of that recommendation or insight. This allows the organization to “close the loop” with respect to fine-tuning the analytic models, the analytic lifecycle process, and the organization decision-making effectiveness.
Many organizations have had data analysts in their organizations for many years or even decades, using tools like SAS and SPSS. But in many cases, these data analysts were buried in the bowels of the organization with only a passing interaction with the business community. We need a process – the Analytics Lifecycle – to ensure that the valuable customer, product, operations, and competitive insights that they are uncovering find their way into the business. And we need a process that leverages, not displaces, the careful and well-architected data acquisition, cleansing, enrichment, and data modeling processes that are already being done by the data warehouse and BI teams today. The Analytics Lifecycle is designed to build upon those best practices by integrating the data science team into the fabric of running the business and helping to move the organization towards a real-time, predictive enterprise.