Okera Bolsters Access Control for Unstructured Data
Some of the most interesting data that’s analyzed by organizations is unstructured data, including images, videos, and text files. But the lack of structure of that data poses regulatory challenges to organizations, which face potential legal jeopardy if consumer data rights are violated. Okera today released new data access control software that it claims can alleviate much of the regulatory burden hanging over big data analytic activities.
Okera emerged from stealth about a year ago with a fine-grained data access control system that was designed to solve a vexing big data problem: How do you give analysts, data scientists, and other stakeholders access to multiple disparate data sources, such as HDFS, S3, and relational databases, while at the same time complying with strict new data regulations like GDPR and CCPA?
The approach taken with the Okera Active Data Access Platform (ODAP) is to move the data access abstraction up a level. Instead of defining and enforcing access control in each individual file system, streaming data platform, or database, the company lets customers configure that access in ODAP, which then federates control down to the individual data stores.
The software did that by essentially viewing data as a series of tables, including data stored in file systems like HDFS and S3. By viewing data through that table construct, ODAP was able to enforce column-level access control, thereby granting different user groups with different levels of access on the same file.
The first release of ODAP primarily targeted structured data, says Amandeep Khurana, Okera co-founder and CEO of the San Francisco, California, company. With today’s update, the company has added support for managing access to unstructured data, while also delivering more secure access to data stored on S3.
By providing file-level access control to unstructured data, ODAP can streamline access to data for a greater number of users and use cases while adhering to strict regulations, Khurana says.
“Let’s say you have a CSV in HDFS, but you don’t really know the structure of the CSV,” Khurana says. “You start with HDFS Access Control Lists, then you can move it into Sentry ACLs. So now actually you have to maintain two types of access control lists, or two kinds of policies, for the same data set, because you didn’t know the structure. You can start to see how complicated this can get at scale, very very quickly.
With the new software, Okera provides a single pane of glass to manage access to those CSV files and the various access paths that analysts and data scientists will use to get them. “So instead of managing two different kinds of policies, you manage only one kind of policy on one single system,” he says.
Okera also bolstered its support for Amazon‘s S3 file system, which is increasingly the data storage repository of choice for organizations building huge data lakes in AWS. Amazon lets users manage access to S3 buckets using Identity and Access Management (IAM) configuration files. However, managing those policies is “massively painful,” Khurana says.
“You actually have to go into S3 configuration bucket and write JSON, which is super complicated,” he says. “When you want to change those policies, you have to change that JSON. This becomes a data engineering nightmare for people. And you also lose all visibility. You don’t know what happened — who gave what access, who got what access.”
And AWS also enforces a limit on how many IAM policies you can have for any given S3 bucket. “So if you have more data sets that you need to manage than your limit,” Khurana continues, “you’re just [shoot] out of luck.”
Organizations are really struggling to manage their big data sets in accordance with emerging regulations, Khurana says. For each data use case, GDPR requires organizations to collect consent from individual users, with hefty fines for each violation. That has forced organizations to get creative with their data management.
One way they do that is by breaking up large datasets into lots of smaller files, each with its own fine-grained access controls. But administrators tasked with managing all those individual data sets soon hits the ceiling in what they can manage. Other organizations are taking a similar approach by duplicating source datasets and applying different access controls to adhere with specific regulatory requirements. But when data sets start getting into the tens or hundreds of terrabytes, that soon becomes prohibitively expensive.
Okera thinks that it has struck upon the right approach, by essentially empowering file and object systems with database-like qualities for fine-grained access control, but without giving up the richness, scalability, and diversity of data that file and object systems bring.
“Machine learning workloads don’t run on databases, but you still need the same kind of controls over them” Khurana says. “And you don’t want a separate system for machine learning workload access management, BI access management, and data lake access management. So it’s the unification that becomes very, very important.”
In addition to access control, ODAP provides real-time tokenization, redaction, and row-level filtering. The software supports a range of tools, from Amazon EMR and SageMaker, to Hadoop tools like Hive, Presto, and Spark, as well as BI tools like Tableau, Birst, and Qlik.