Big Data

What is “Just-Enough” Governance for the Data Lake?

Rachel Haines By Rachel Haines February 26, 2015

I’m a process geek. Specifically, I’m all about answering the question “How do organizations transform their cultures to become more data-centric and data-driven by implementing governance processes?”.

And while I’m all in on governance, I’m not a proponent of BIG governance. I don’t think the solution to our data access and use issues is to implement yet another bulky silo within the enterprise.

For some time I’ve been using the term “just-enough governance” to describe a Lean approach to information governance.

Just-enough governance is similar to the Lean Startup methodology concept of building of a Minimum Viable Product (MVP). Wikipedia defines MVP as:

“A minimum viable product has just those core features that allow the product to be deployed, and no more.”

From an enterprise perspective, just-enough governance means building only the process and control necessary to solve a particular business problem.

Data Lake Governance

As data lake implementations become more operationally mature, there are two major challenges which must be addressed: data ingestion and data consumption.

image1

Adding to the complexity of these challenges is the idea that all data in the lake should not be treated as equal. Governance should be fit for purpose. Some data should be conformed, predictable, very accurate, certified quality, and subsequently highly governed. Other data just needs to be availed with more agility and requires less accuracy, quality, conformity, and subsequently less governance.

Regardless of the level of governance, ingestion of data into the lake will require, at a minimum, a set of capabilities for:

  • The definition of the incoming data from a Business use perspective;
  • Documentation of the context, lineage, and frequency of the incoming data;
  • Security level classification (public, internal, sensitive, restricted) of the incoming data;
  • Documentation of creation, usage, privacy, regulatory, and encryption business rules which apply to the incoming data.

Where data in the lake is lightly or heavily governed, it is important that the ingestion policies and processes also include:

    • Identification of the data owner (sponsor) of the ingested data;
    • Identification of the data steward(s) charged with monitoring the health of the specific data items
    • Continuous measurement of the data quality as it resides in the data lake. This includes:
      • Definition of the metrics and frequency which will be used to score the quality of the data;
      • Definition of the data scoring rubric for the data.

Regarding the consumption of the data from the lake, policies and processes must be established to:

    • Publish and maintain a data catalog (containing all the metadata collected during ingestion and data quality monitoring) to all stakeholders;
    • Configure and manage access to data in the lake;
    • Monitor PII and regulatory compliance of usage of the data
    • Log access requests for data in the lake.

“Just-enough Governance” for the Data Lake

Tactically, the number of policies and procedures necessary to kick-start the support the above operational capabilities, boils down to a short list:

    • A comprehensive data on-boarding policy
      • This policy will define what metadata will be collected for each data element loaded to the data lake.
      • The on-boarding procedure will document data definition, context, lineage, classification, and business rules. Where the data element is to be highly governed, establishing ownership, decision rights, and accountability will also be addressed.
      • It is critical to the over-all health and usability of the data lake that all metadata created as a result of data on-boarding will be written to the data catalog.
    • An ongoing data quality policy
      • This policy will mandate that each data element has an appropriate level of ongoing data quality scoring and reporting, based on availability and use criteria.
      • The data quality procedure will document the metrics and the scoring rubric used to determine the level of quality for each data element. This procedure will guide data stewards when setting an appropriate frequency for on-going data profiling and reporting.
      • Data quality sores created as the result of execution of periodic data profiling will be written to the data catalog.
    • Metadata management policy
      • This policy will stipulate how the metadata for data in the lake is to be stored and published to the data stakeholders.
      • Since metadata in the data catalog will be a significant resource for users of data in the lake, it is vital that the metadata management policy empower an editorial team to monitor policy compliance and keep the data catalog in sync with the actual data assets in the lake.
      • The metadata management procedures will document how, and by whom, metadata is to be maintained. As well, the metadata management procedure may be extended to provide procedures for:
        • Requesting new data sources for the data lake;
        • Requesting removal / archival of data in the data lake;
        • Modification of existing data for a data element (definition, ownership, classification, security requirements, etc.).
      • Note: On-boarding and data quality processes should be tightly linked to the metadata management policy and processes, in that, information gathered during on-boarding or as a result of quality scoring must be stored in the data catalog.
    • Compliance and audit policy
      • This policy will identify the people accountable for compliance monitoring and reporting and lay out a cadence for on-going, as well as quarterly / yearly audit of data lake access and usage.
      • The compliance and audit procedure will identify tools and data to be used for compliance monitoring as well as provide guidance around specific reporting requirements related to regulatory and internal compliance and audit reporting.

Once these basic policies and procedures are in place, and as new business needs arise, data lake governance should be expanded to include additional policies and procedures. An example mind map of potential data lake governance policies can be seen in the diagram below:
image2

Conclusion

“Just-enough governance” as a basic principle, will help the enterprise identify what additional policies and procedures are necessary, and when is the right moment to commit resources to the formalization of those policies and procedures.

“Just-enough governance” should help the enterprise avoid building a top-heavy, monolithic governance silo which is seen to be overhead and bureaucracy which slows down both Business and IT

Adopting a “just-enough governance” strategy means:

    • Building only the policies and procedures that are needed to solve specific business problems;
    • Adopting a Lean build-measure-learn governance culture;
    • Making Trust, Transparency, and Discipline the hallmarks of data management across the enterprise.

“Just-enough governance” is one key tool on the transformational journey Bill Schmarzo calls the Big Data Business Model Maturity Index.

Rachel Haines

About Rachel Haines


Read More

Join the Conversation

Our Team becomes stronger with every person who adds to the conversation. So please join the conversation. Comment on our posts and share!

Leave a Reply

Your email address will not be published. Required fields are marked *