Bill’s Most Excellent Data Scientist Adventure
This past week I’ve had a chance to attend EMC Education Services new class titled “Data Science and Big Data Analytics.” The course provides a hands-on practitioner’s approach to the techniques and tools required for Big Data Analytics. My primary interest in the course was to see if it was really possible to transition an old school BI and data warehouse guy (me) into this new world of Data Science. After attending the course, my assessment is that this projected world shortage in data scientists can be filled with the army of BI and data warehouse professionals, but it’s going to take work, effort and courses like this to help fill that gap.
So first about the class. This was the first GA release of the course and I have to say that the course was most excellent! The material was well organized with a nice mix of classroom lectures and hands-on workshops. The instructors were knowledgeable, well-prepared and very engaging. The workshops were very well laid out, providing “guard rails” for each of the assignments that allowed the students to quickly move through the exercise while still providing opportunities for exploration with the tools, techniques and case studies. We had about 20 students in the class from a variety of backgrounds and experiences, which added to the richness of the class and the classroom interactions. If I had a complaint, it would be that the course was so popular that the classroom felt a bit crowded at times. I also needed a bigger display to have the 4 or 5 tools I was learning open at the same time! All in all, a four-star rating!!
Business Intelligence Versus Data Analytics
The graphic below is a pretty common way to think about the worlds of business intelligence (BI) and data science (predictive analytics). Let me first say that to think of these worlds in isolation of the other is a big mistake. Business intelligence is typically thought as of being retrospective, a rearview mirror view of the business, focusing on what happened (hindsight) and what is happening (insight). Predictive analytics, is typically thought as being forward thinking, a windshield view of the business, focusing on what is going to happen (foresight). However, many BI implementations do include time series analysis and what if modeling in order to help the business make forward-looking decisions (e.g., what price should I charge, what customers should I target, how many clerks am I going to need).
One of the biggest differences between the BI analyst and data scientist is the environment in which they work. BI specialists tend to work within a highly structured, data warehouse environment. It takes a yeoman’s effort to add a new data source (often this effort is measured in months) or get the approval to keep more granular data and/or more history in the data warehouse.
However the data scientist has historically created a separate “sandbox” in which to load whatever data they can get their hands on (both internal and external data sources). Once within this environment, the data scientist is free to do with it whatever they wish (e.g., data profiling, data transformations, create new composite metrics, and model development, testing and refinement).
Data Analytics Lifecycle
Let’s start with the Data Analytics Lifecycle (see chart below) to gain an understanding of how a data scientist works.
This chart outlines the data scientist discovery and analysis process and key work process. It also highlights the highly iterative nature of the data scientist’s work. Let’s take a look at the specific tasks and skills required for each of the Data Analytics Lifecycle steps and see how the typical BI analyst’s skills map to that step.
In summary, to make the transition from BI specialist to data scientist is going to require the following new skills and capabilities:
- Deep dive into the multitude of statistical and predictive analytics models. Without a doubt, you’re going to have to get out your college statistics and advanced statistics books and spend time learning how and when to apply the right analytic models given the business situation.
- Learning new analytic tools like R, SAS and MADlib. R, for example, is an open source product for which lots of tools (like RStudio) and much training is available free and on-line.
- Learning more about Hadoop and related Hadoop products like HBase, Hive and Pig. There is no doubt that Hadoop is here to stay, and there will be a multitude of opportunities to use Hadoop in the data preparation stage. It’s the perfect environment for adding structure to unstructured data, performing advanced data transformations and enrichments, and profiling and cleansing your data coming from a multitude of data sources.
In my next blog I’m going to share some training and tool recommendations. In the meantime, give serious consideration to attending EMC’s new Data Scientist and Big Data Analytics training class (http://education.emc.com/guest/campaign/data_science.aspx). It’s well worth the time investment.
Finally, for those of you coming to Strata in Santa Clara at the end of the month, please stop by and see me. I’ll be presenting on Tuesday as part of the Jumpstart track, will be holding office hours on Thursday, or in EMC booth 201. I’d love to hear your thoughts and observations on this rapidly expanding world of data science.