Follow Datanami:
June 14, 2023

Getting the Upper Hand on the Unstructured Data Problem


Unstructured data accounts for the vast majority of data stored in the world today, and it’s growing at a geometric rate. Organizations today may have petabytes of the stuff spread around various object stores and file systems in the cloud and on-prem. While many want to get value out of it with AI and advanced analytics, the simple act of keeping it costs money and increases security and privacy risks. So what’s an unstructured-data hoarder to do?

Krishna Subramanian, the president, COO, and co-founder at unstructured data management software vendor Komprise, recently shared some insights into the unique problems posed by unstructured data management, as well as how her company is addressing those needs with the latest release of Komprise’s software.

Eighty-five to 90% of the world’s data is unstructured, according to Subramanian. It includes words and pictures, and many things in between, such as PDFs and emails, but also some very big data sources, like genomics, X-rays, digital pathology, and log data from autonomous vehicles.

“When we say unstructured data, what we mean by that is any data that’s not sitting in a database, which is pretty much 85% to 90% of all data today,” Subramanian said. “So it’s data that’s generally stored as files or as objects in the cloud.”

In 2022, IDC said 175 ZB will be created by 2025 (Image courtesy IDC)

The problem with unstructured data is that it keeps on growing. Today’s distributed file systems and cloud object stores have practically unlimited storage capacities. It’s so easy to spin up another data lake, and so that’s the approach taken by many organizations. But they never seem to delete data or drain the data lakes, and so the data just keeps growing.

“You have to understand that unstructured data is growing massively. Very quickly it’s gone from 10 terabytes look looking like a big number to now we have customers that are 100 petabytes-plus and they’re already thinking exabytes,” Subramanian told Datanami.

“Most companies have many, many storage silos in different data centers where this data is sitting, and quite often, they just don’t even know how much data they have,” she continued. “Users are generating data, applications are generating data, and IT is usually just tasked with storing and protecting that data. So IT doesn’t often know why are people creating this data, how fast does it growing, and what data is actually hot and what’s cold.”

‘No Good Tools’

Komprise is the third startup for Subramanian and her co-founders, CEO Kumar Goswami and CTO Michael Peercy, with their last startup being acquired by Citrix Systems. Before founding Komprise in 2014, the trio often discussed the unstructured data management problem with previous customers.

Unstructured data is pretty much everything that isn’t stored in a database (Andrea-Danti/Shutterstock)

“[The customers said] ‘We’re having this problem. We’re drowning in unstructured data. We know how to manage databases, but this data is a beast,’” Subramanian said. “’We don’t really know how to manage it. There are no good tools.’”

The storage aspect of the unstructured data management problem has been solved, thanks to object and distributed file systems. But what they needed was software that could look across all the data silos and create a unified view of it.

“What we really need is a software solution that can look at data no matter where it’s stored, can tell us how much data we have, can tell us what’s hot, what’s cold, how much it’s costing us, who’s using it, and then it can move data from one place to another,” Subramanian said. “So that’s what we need. And that’s why we created Komprise. We needed a data management software service which does exactly that.”

Global View of Unstructured Data

Komprise’s tools provide a variety of capabilities for unstructured data management. According to Subramanian, there are four main benefits that Komprise’s software delivers to customers.

First is visibility into all of a customer’s unstructured data. While individual data storage providers may provide a view into their particular silo, Komprise delivers a global index that tracks metadata, such as file name, directory name, file owner, data created, data modified, where it’s located, and how long it’s been around, across multiple data silos.

“When you point Komprise it at your different storage environments, what Komprise does is it quickly indexes all the data,” Subramanian said. “So anything you point us at it, we not only give you analytics on how much you have and you know how much it’s costing you, but in the background we actually create a full index of all the data.”

By tracking the age of data and how often it’s used, Komprise can help identify data that’s no longer providing value and empower users to cull it. The company claims customers can save 80% of the cost of unstructured data storage  by  moving data  to  cheaper  storage.

Secondly, Komprise enables users to search all their data using that global index. Users can search by typing in their own queries or via an API. An autonomous car company could use this to identify specific images stored across their data silos.

“You can search it and say ‘I want to find all pictures I took of this model car, when it was near a stop sign,’ and then Komprise will show you all the pictures that you took of that car, even if some of that might be in a data center in Malaysia, some might be in your cloud, some might be in a different data center,” Subramanian said.

Thirdly, Komprise enables users to create data movement polices, which are automatically executed by the software. Think mainframe job scheduler, but for unstructured data in the cloud.

“You can create a policy saying ‘Anything that is over a year old, just transparently move it to the cloud,’” Subramanian said. “But we’ll add a local link so it looks like the file is still here even though it’s sitting in the cloud. We do that kind of tiering and data migration where we could make a copy of it into Databricks if you wanted a copy.”

Fourth, Komprise creates tags for all the data and data movement policies and results, and keeps track of those tags for later use.

In May, Komprise updated its software with several new capabilities, including a new share-based access control mechanism that leverages Active Director or LDAP to enable groups of users to gain access to Komprise workflows.

This will lower the barrier of entry for the people who need access to data, which is typically the business users or the researchers, not the IT department, Subramanian said. However, this approach gives IT what it wants and needs, which is the ability to enforce access and keep the data secure, she said.

Komprise also launched a new user interface that gives business users or researchers the ability to directly explore and access files, as opposed to writing a query or running a search. “They just want to click down and just find what they want, and just pick it,” Subramanian said. “So it’s a slew of those kinds of features to improve the collaboration between users and IT.”

The Campbell, California company appears to be gaining traction. In February it announced that it doubled revenues in 2022 for the third consecutive year, including 100 customers moving to Microsoft Azure. Another customer is the drug manufacturer Pfizer, which used Komprise to migrate 2PB of unstructured data to Amazon S3 in 2020, saving 75% on the cost of cold storage.

As the world’s organizations generate the 175 to 200 zettabytes of data IDC estimates will be generated by 2025, companies will need more solutions for unstructured data management. Komprise provides one such solution.

Related Items:

Data Management Implications for Generative AI

Unstructured Data Growth Wearing Holes in IT Budgets

Big Data Is Still Hard. Here’s Why