Big Data

Bill’s Most Excellent Data Scientist Adventure

Bill Schmarzo By Bill Schmarzo CTO, Dell EMC Services (aka “Dean of Big Data”) February 14, 2012

This past week I’ve had a chance to attend EMC Education Services new class titled  “Data Science and Big Data Analytics.”  The course provides a hands-on practitioner’s approach to the techniques and tools required for Big Data Analytics.  My primary interest in the course was to see if it was really possible to transition an old school BI and data warehouse guy (me) into this new world of Data Science.  After attending the course, my assessment is that this projected world shortage in data scientists can be filled with the army of BI and data warehouse professionals, but it’s going to take work, effort and courses like this to help fill that gap.

So first about the class.  This was the first GA release of the course and I have to say that the course was most excellent!  The material was well organized with a nice mix of classroom lectures and hands-on workshops.  The instructors were knowledgeable, well-prepared and very engaging.  The workshops were very well laid out, providing “guard rails” for each of the assignments that allowed the students to quickly move through the exercise while still providing opportunities for exploration with the tools, techniques and case studies.  We had about 20 students in the class from a variety of backgrounds and experiences, which added to the richness of the class and the classroom interactions. If I had a complaint, it would be that the course was so popular that the classroom felt a bit crowded at times. I also needed a bigger display to have the 4 or 5 tools I was learning open at the same time!  All in all, a four-star rating!!

Business Intelligence Versus Data Analytics

The graphic below is a pretty common way to think about the worlds of business intelligence (BI) and data science (predictive analytics).  Let me first say that to think of these worlds in isolation of the other is a big mistake.  Business intelligence is typically thought as of being retrospective, a rearview mirror view of the business, focusing on what happened (hindsight) and what is happening (insight).  Predictive analytics, is typically thought as being forward thinking, a windshield view of the business, focusing on what is going to happen (foresight). However, many BI implementations do include time series analysis and what if modeling in order to help the business make forward-looking decisions (e.g., what price should I charge, what customers should I target, how many clerks am I going to need).

One of the biggest differences between the BI analyst and data scientist is the environment in which they work. BI specialists tend to work within a highly structured, data warehouse environment.  It takes a yeoman’s effort to add a new data source (often this effort is measured in months) or get the approval to keep more granular data and/or more history in the data warehouse.

However the data scientist has historically created a separate “sandbox” in which to load whatever data they can get their hands on (both internal and external data sources).  Once within this environment, the data scientist is free to do with it whatever they wish (e.g., data profiling, data transformations, create new composite metrics, and model development, testing and refinement).

Data Analytics Lifecycle

Let’s start with the Data Analytics Lifecycle (see chart below) to gain an understanding of how a data scientist works.

This chart outlines the data scientist discovery and analysis process and key work process.  It also highlights the highly iterative nature of the data scientist’s work. Let’s take a look at the specific tasks and skills required for each of the Data Analytics Lifecycle steps and see how the typical BI analyst’s skills map to that step.

In summary, to make the transition from BI specialist to data scientist is going to require the following new skills and capabilities:

  • Deep dive into the multitude of statistical and predictive analytics models.  Without a doubt, you’re going to have to get out your college statistics and advanced statistics books and spend time learning how and when to apply the right analytic models given the business situation.
  • Learning new analytic tools like R, SAS and MADlib.  R, for example, is an open source product for which lots of tools (like RStudio) and much training is available free and on-line.
  • Learning more about Hadoop and related Hadoop products like HBase, Hive and Pig.  There is no doubt that Hadoop is here to stay, and there will be a multitude of opportunities to use Hadoop in the data preparation stage.  It’s the perfect environment for adding structure to unstructured data, performing advanced data transformations and enrichments, and profiling and cleansing your data coming from a multitude of data sources.

In my next blog I’m going to share some training and tool recommendations.  In the meantime, give serious consideration to attending EMC’s new Data Scientist and Big Data Analytics training class (http://education.emc.com/guest/campaign/data_science.aspx).  It’s well worth the time investment.

 

Finally, for those of you coming to Strata in Santa Clara at the end of the month, please stop by and see me.  I’ll be presenting on Tuesday as part of the Jumpstart track, will be holding office hours on Thursday, or in EMC booth 201. I’d love to hear your thoughts and observations on this rapidly expanding world of data science.

Bill Schmarzo

About Bill Schmarzo


CTO, Dell EMC Services (aka “Dean of Big Data”)

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Dell EMC’s Big Data Practice. As a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

Read More

Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *

0 thoughts on “Bill’s Most Excellent Data Scientist Adventure

  1. Bill,

    Nice post.

    It is good to see that the old BI/DW gang will not be out of work due to the rapid evolution of the market…;-)

    I am very pleased that we are now talking about all data rather than just the traditional structured view of the world.

    All we need now is an all encompassing data management platform with an indexing engine, natural search and BI driven by a voice interface that selects the correct predictive and descriptive models based on the problem description and the data available and we will be set.

    See you soon.

    Best,
    John

  2. Hey John, thanks for the comments. Yea, us old BI / DWH dogs are going to have jobs for a long, long time!! I think that’s good news?

    I think one of the more interesting things about BI is the impending integration of more predictive capabilities, likely integrated right into the tool itself (instead of in a separate analytics environment). The tools should have the basic capabilities to help guide you with respect to the right analytic models you should be using given the business problem, and provide guidance on the significance of the findings and insights.

    Regarding voice activated, I’m still worried about the co-worker running around the office yelling “Delete All” as we work at our new voice-enabled devices. In fact, that’s something that I could see you doing, John!!

  3. Who is Ted if you are Bill?

    See you at Strata. SiliconANGLE.tv will be doing live broadcast for three days #theCUBE – u need to be on again.

  4. Pingback: Innovation Excellence | A Strategy for Innovation Analytics