What Librarians Can Teach Us About Managing Big Data
People who have worked in data governance for any length of time always have a certain look of despair, stemming mostly from a continual struggle to be understood and relevant among key business leaders. On the one hand, executives boldly extol the importance of data and speak with deep respect about data as a corporate asset. On the other hand, when it comes time for true ownership for the governance and stewardship of these assets, this executive support seems to evaporate. Businesses seem to want the benefits of data and analytics, but still rely on an outdated IT led Business Intelligence model to define and manage data, only to be disappointed by the results. When Data Governance leaders explain how they can help, leaders often mistake this offer of help for an offer to take responsibility, leading to further disenchantment and frustration. Speakers at Data Governance conferences often remind me of Rodney Dangerfield: “I don’t get no respect.”
Ironically, big data is about to fix this problem.
We are experiencing a business revolution powered by analytics from any kind of data that business professionals can get their hands on: transactional data, social media data, device data from the internet of things, pictures, e-mails…anything. While data science gets a lot of attention, data is the fuel of data science, and rapid access to information inside and outside the corporate firewall is the key to finding analytic competitive differentiation before your competitors. So what, then, becomes of the role of data governor? For effective analytics, it must shift to be something more akin to a data librarian, the person who can get you the research material that you need whenever you need it. And thus will begin the transformation from data governance as a perceived impediment to data to the driving force behind business innovation by providing clarity to the Dewey Decimal System that will organize and catalog data
The Data Librarian
Traditional Data Warehouses do not work unless there a common vocabulary and understanding of a problem, but consider how things work in academia. Every day, tenured professors and students pore over raw material looking for new insights into the past and new ways to explain culture, politics, and philosophy. Their sources of choice: archived photographs, primary documents found in a city hall, monastery or excavation site, scrolls from a long-abandoned cave, or voice recordings from the Oval office – in short, anything in any kind of format. And who can help them find what they are looking for? A skilled librarian who knows how to effectively search for not only books, but primary source material across the world, who can understand, create, and navigate a catalog to accelerate a researcher’s efforts. And does this librarian promise a positive outcome? Of course not. The librarian provides access to source material and the researcher takes it from there with varying degrees of success.
The Radical Shift to the Wikipedia Mindset
Today, we are all now researchers of varying degrees who want to know things whenever we want. What is the temperature outside so I can decide if I need a coat? For whom should I vote? What is the latest in the Syrian Refugee crisis? Our primary tool is, of course, the internet, where we use search engines like Google and Bing to find items that we are looking for. These search engines have become our data librarians, tagging and indexing content on the web so that we can access it quickly. Now for a provocative question:
Is all of this data right?
It is important to recognize the emerging reality of data quality. Sometimes it matters and sometimes it doesn’t. Businesses will continue to require regulatory and financial reporting that conforms to standards, principles, and laws. For this, conventional data governance and stewardship techniques will reign. However, think for a moment what a small percentage of information needs these data needs constitute as a percentage of overall needs. For some users, principally analytics and data science users, fast access to imperfect data is far more important than slow access to perfect data. Consider the words of George S. Patton: “A good plan violently executed now is better than a perfect plan executed next week.” It’s the same principle – speed trumps quality.
A Framework for Meeting Differing Data Needs
In the 1990s, when process improvement was a new, emerging trend, Michael Hammer leveraged the project management Time, Quality, Cost constraint model to develop the following model, outlined in his book Faster, Cheaper, Better.
This model once again reinforced the difficulty of having a process that met all three criteria satisfactorily. If we apply this framework to data access, we realize that we have two different models that we require for effective data governance that look something like the following:
This model provides guidelines for “just enough” governance. Analytics users prefer the best available data now rather than cleansed data months from now. This model focuses on high breadth of management with very shallow depth and should be viewed as a model for the first step of governance. Simply, as data is on-boarded, it is described and catalogued for future use. Models B and C require a higher threshold for quality because they will become a part of the business decision making fabric. Model B should be though about as a means to harden and enrich data from Model A. Model C should be used for localized use cases that do not require extensive cross-business alignment, which is the major consumer of time. Ultimately, Model C works well for localized business needs and metadata leveraged in these processes is valuable as this data is leveraged across business units either in Model A By applying this model, mature governance organizations can simultaneously meet local needs quickly, meet analytical needs in an agile fashion, and maintain quality standards required for the most important reporting with a curation process that might take months.
Data Governance of Big Data is not simply conventional stewardship against bigger volumes of data. Rather, it is a rethinking of data needs for different users and the establishment of a two tiered system that allows some users access to raw data without sacrificing quality required for some key business functions. In the Big Data World, Data Governance professionals will need to think about themselves as data librarians, helping users to locate the best available data for knowledge workers. They will then establish processes for allowing these knowledge workers to identify the most important data for conformity and quality in order to move it to the second tier. These are not competing objectives, but rather are complementary and essential in the age of information.