Follow Datanami:
June 22, 2016

Avoid These Five Big Data Governance Mistakes


If you’re embarking upon a big data project, then you’re likely running into one or more data management challenges. The decisions you make regarding how you enforce data governance and how you control data flows can make or break your project.

Here are five data governance mistakes you should avoid:

1. You Have No Data Governance Strategy

If you said to yourself, “Huh, what’s data governance?” then you’re likely



making this mistake. Data governance refers to an overarching strategy that defines how organizations ensure the data they use is clean, accurate, usable, and secure.

As your organization embarks upon big data projects, you often solve one or more of these challenges in an ad-hoc manner. That approach may work for a while, but as you get big data successes under your belt and take on more complex projects, the lack of governance can come back to haunt you.

There are several components to a data governance strategy, including: setting up processes that dictate how data is stored and protected; setting up a set of standards and procedures for ensuring how authorized personnel can access and use data; and setting up controls and procedures to ensure the rules are being followed.

Like most things in life and IT, data governance doesn’t work with a “set it and forget it” mentality. Start small with your data governance initiative and then grow it over time to meet the specific needs of your organization.

2. Relying Too Much on Unicorns

Many shops turn to their data scientists (i.e. unicorns) for all matters relating to big data. Like the poor miller

Sorry kids--unicorns don't exist

Save your unicorns for data science

who found he could turn straw into gold, corporate bosses expect their unicorns to magically turn raw data into actionable insight.

That approach likely won’t work for long. The truth is, if you’re lucky enough to have landed a unicorn, you’re paying them way too much to ask them to be “data janitors,” let alone be in charge of an entire data governance strategy.

Data governance is best led by a collection of data stakeholders from the IT department, line of business, and compliance. The Data Governance Institute also recommends hiring a Data Governance Officer (DGO).

3. Letting Schemas Run Wild

This mistake is often made in tandem with the implementation of a data lake. The forgiveness of HDFS enables you to throw just about any kind of data, with any kind of schema, into a Hadoop data lake and worry about sorting it out later.database schema

This “schema on read” approach may work for some types of data, especially ones that change often and can’t be pigeonholed into preconceived schemas. But schema on read can only take you so far, and at some point, schemas must be enforced.

Hadoop brings a plethora of data processing engines like Spark, Pig, and good old MapReduce to help you give shape and form to data – that is, to make it usable. The schema-on-read this runs counter to core data governance principals, which require that you know what kind of data you’re storing and processing.

4. Storing Everything Forever

One of the important facets of a good data governance strategy is data



retirement. At some point, every piece of data must enter that great recycling box in the sky. But all too often, organizations decide they’re never going to throw away another piece of data again.

If you’re organization follows this “keep everything” mandate, good luck. You’ll likely need lots of extra cycles just to keep the rotting trash heaps in order. Consider this statistic from the latest Veritas’ Data Genomics Index 2016 survey, which found that 40 to 60 percent of the data an average organization stores these days is redundant, obsolete, or trivial (ROT).

Organizations spend millions of dollars a year storing data they’ll never use. This is not just a failure of good business sense—it’s a failure of data governance.

5. Not Using Power Tools

So there’s a lot that goes into having an effective data governance strategy. You need the right people in place to implement it, you need a good policy that lays out the priorities and general strategy, and you need good processes that help you implement data governance on a day-to-day basis.

(Andrey Eremin/Shutterstock)

(Andrey Eremin/Shutterstock)

But there’s also a case to be made for getting the right products in play. No one tool will solve every data governance challenge for you. But the big data ecosystem is delivering an increasingly compelling collection of tools that can help automate big chunks of it.

For example, tools such as Apache Atlas (incubating), which is the open source data governance framework that came out of Hortonworks‘ Data Governance Initiative, are helping to enforce data controls in the Hadoop environment. Data quality tools are also helping to solve a particular aspect of the data governance challenge.

At the recent Leverage Big Data ’16 event, Asif Alam, the Global Business Director for the Technology Sector at Thompson Reuters, acknowledged that data governance was a big and growing challenge, but added that tools were making things better. “Problems we’re solving now were impossible to solve three years ago,” Alam said.

Related Items:

The Growing Menace of Data Hoarding

Data Science Operationalization in the Spotlight at Leverage Big Data ’16

Why Self-Service Prep Is a Killer App for Big Data