Unstructured Data Analytics Shouldn’t Be Such a Mess
The problem with searching for a needle in a haystack is that the process, by nature, is inefficient. So why has it become a popular analogy for analytics efforts within the enterprise? Because today’s analytics attempts – particularly for unstructured human data – are typically a mess.
Today’s analytics are often ad hoc and rely on incomplete or skewed sample sets. They commonly focus on only one narrowly-defined data type. So for each glimmering “needle” of insight, it seems that heaps of data are scattered and cast aside, often at the cost of subsequent business efficiency.
When it comes to business content such as files, email, social media, IMs, calendar entries, images, and more, firms are failing to extract meaningful insight despite the potential wealth of information contained within.
There’s a reason for this: Because the data is poorly managed to begin with. Businesses are treating analytics as a separate business function from data governance, when it’s actually fundamentally dependent on it. Analysis occurs downstream; so by neglecting initial information governance infrastructure and practices, the enterprise is essentially sampling tiny random buckets of data from a whitewater river of information.
Many firms struggle to manage or even understand what sort of unstructured content they even have, let alone begin to effectively manage it. History is partially to blame, to be sure; most attempts at managing unstructured content were hastily prompted by waves of regulatory and legal reform that demanded immediate action. A reactive response was triggered, and many of those initial “band-aid” information management fixes remain in place today. Simply scratching beneath the surface often reveals a tangled mess of siloed
data platforms such as enterprise content management systems (ECMs), legacy systems, duplicated copies, and even entire missing categories of data.
It’s no wonder that businesses are having trouble getting analytics value from this data.
For unstructured data, it makes sense; the technology required to process large volumes of diverse human-generated content is nascent compared to traditional BI and ERP systems that mine more mature and structured forms of data. Add to that the general state of mismanagement of most unstructured content, and you have an environment in which data never reaches its potential … or worse, results in erroneous business decisions.
We don’t necessarily need to get rid of the extra data “hay” that these stacks are composed of; that would entail getting rid of potentially valuable content. We just need to completely re-think how the data itself is managed. No more data “haystacks” means no more disparate data sources, no more ad hoc sampling attempts, and no more dirty or duplicated data. Furthermore, it vastly reduces the compliance risk associated with data mismanagement; with consolidated control, policies for management and eventual disposal can be implemented centrally and securely.
The lesson here is that data governance is the necessary foundation of all successful data analysis. The statistics axiom of “garbage in, garbage out” is used ad nauseam for a reason: because it’s accurate and timelessly relevant.
So at risk of abandoning our original metaphor, we’re trying to build a data lake, pooling all available resources into a single environment where they can be managed and analyzed in real time. Forward-leaning businesses are already making strides to achieve this, and they’re not doing it with flashy analytics tools – those can come later, once the foundation is built. In a cohesive governance environment, analytics can be brought TO the data, rather than data cumbersomely being sampled and brought TO the stand-alone tools.
If large organizations want better analytics, they need to start with better information governance practices. After all, they probably should have been better all along.
About the author: Kon Leong is CEO and Founder of ZL Technologies. For two decades, he has been immersed in large-scale information technologies to solve big data issues for enterprises. His focus for the last 14-plus years has been on massively scalable archiving technology to solve records management and eDiscovery challenges for the government and private sectors. He speaks frequently at records management and eDiscovery conferences on cutting edge trends and solutions. A serial entrepreneur, Mr. Leong earned a BS degree from Loyola (Concordia U) and an MBA from Wharton (U of Penn).