Beware the Dangers of Dark Data
The amount of data we’re generating is doubling roughly every 18 months, which is eerily similar to Moore’s Law scale for the growth of processing power. But much of that new data will remain invisible to those who would use it. The situation around this dark data threatens to derail big data initiatives before they can get off the ground.
Gartner defines dark data as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes.” By some estimates, including one by Tamr, up to 90 percent of the data stored in a typical organization is dark data.
The dark data problem is largely one of organizational structure, and has parallels in the master data management (MDM) issues that are threatening data lakes, the ongoing situation with maintaining multiple data silos, and Hadoop’s Wild West approach to data governance.
In a perfect world, data would be indexed and categorized in a rationale manner that is self-evident to anybody with an interest in access that data. The problem is, few of us think in exactly the same way, and so we squirrel away our data using schemas that are trapped in our heads. So while one salesman may store his prospects by date, another may store his alphabetically, while a third one tracks them by the likelihood he’ll close the deal.
“It’s almost completely subjective,” says Greg Milliken, the vice president of marketing for content management software provider M-Files. “Companies are just drowning in data and things are getting lost.”
Chasing Dark Data
When data get lost, opportunities are missed and work is re-done, which hurts profits and keeps the company from being as competitive as it could be. The problem is, the CEO may not even be aware that much of the data in his company is dark, effectively cut off from the rest of the organization.
So how did we get into this predicament? Milliken explains:
“In any business, what we do is we sort of create these relationships between those various important information objects, which incidentally are often managed by other system of record–a CRM system for customers or an ERP for vendors and accounting process,” he says. “When you marry those two things, you create this ability to discover data. Data goes dark often because there might be relationships that establish relevance that might not match up in what we think of as a typical search.”
ERP and CRM applications rose to prominence in the 1990s largely because they unified the data, documents, processes, and people needed to get a job done. These ERP and CRM systems mostly involved worked with structured data, which relational databases like DB2 kept neat and tidy for us. While these systems of record are still important, businesses today are looking for ways to bring semi-structured and unstructured data into loop, which is not something CRM and ERP systems were designed to handle.
To solve the problem, companies need a way to connect the structured data kept in the ERP and CRM systems with the unstructured data (i.e. “big data”) they’re increasingly trying to leverage for a competitive advantage. That’s easier said than done, Millken says. “That data is fundamental…to establishing the relevance of the unstructured content,” he says. “The complete picture of the document or data problem is not clear without this.”
Big Data Tech’s Dark Side
Companies are increasingly adopting technologies like Hadoop and NoSQL databases, which allow them to collect any type of data and hammer it into the shape, or schema, that they need when they access the data.
But alas, they’re finding it difficult to keep their data straight, which is one reason why Hortonworks helped found the Data Governance Initiative earlier this year. “You know that drawer in your kitchen?” Hortonworks product manager Tim Hall told Datanami earlier this year. “The junk drawer. We don’t want Hadoop to turn into that.”
Search engines such as Solr and Elasticsearch definitely have a role to play in rooting out dark data. Ever since “Google” became a verb, we’ve relied on the power of search engines to organize ourselves—our emails, our pictures, and our spreadsheets. But even search engines have their limits.
“You begin to lose context in pure search,” Milliken says. “[The solution] is about search, but it’s also about having some structure to the data. You just want to be focused on data that’s most important to you. [You want] a system that’s able to group information and more readily reveal what’s more relevant to you, and cut out the things that are not relevant to you.”
We’d all be better with such an approach, but according to Milliken, it requires having the discipline to properly tag items or mark them up with metadata that appropriately categorizes the and enables it to match up with how people want to access it. The software, such as M-Files’ content management system, can keep data in the light by creating some structure around it—including leveraging text analytics where appropriate to automatically tag some types of content. But at the end of the day, it comes down to getting buy-in from the people.
While it may sound intimidating, it really isn’t very hard, he says. “It ends up being extremely intuitive,” he says. “It’s way harder to figure out where to put stuff in some structure based on some rules or processes that you have to learn. If it’s more objective – like proposals related to that customer — that become very intuitive. We think at least.”
That approach makes a lot of sense. There’s no silver bullet to the dark data problem, but with enough forethought and perseverance, the data can be brought to the light—and kept there forever.