Follow Datanami:
February 1, 2022

Big Data Analytics: Top Three Data Security Mistakes

Nong Li


Properly securing data lakes and complying with privacy regulations are common Fortune 500 board-level concerns and for good reason. Simply put, businesses run on data. However, most organizations are still struggling to use data responsibly at the scale and velocity required to innovate today. I spoke recently with a C-suite technologist in the financial services industry who flat out said, “data lakes scare me.”

Data owners must keep pace with the influx of personally identifiable information (PII) and confidential data stored within murky data lakes, along with the proliferation of governing privacy regulations. Data access control for regulatory compliance has become extraordinarily fine-grained, like showing only the last four digits of Social Security numbers or a randomized string of digits instead of real phone numbers to moderately privileged data consumers, where other people are denied all access, and a few privileged people can work with full fidelity data in-the-clear. Organizations also need to properly execute “right to be forgotten” Data Subject Access Requests (DSARs) based on the General Data Protection Regulation (GDPR).

Fortunately, big data governance technologies are maturing, thanks to hard lessons learned by some of the world’s largest financial institutions and globally recognized brands. These companies handle PII and other sensitive data on a daunting and awe-inspiring scale.

Here in part one, we discuss lessons learned from the three most common “tried and failed” approaches to implementing fine-grained access control. In part two, we will analyze where large enterprises find that “sweet spot” where big data can be used responsibly.

“Secure Copies” Are Not What You Think They Are

When data engineers make “secure copies,” they retain two or more curated versions of a dataset: one with sensitive data in full fidelity for privileged user access, plus one or more additional copies with values redacted (tokenized, masked, filtered, etc.) for specific personas and use cases. This method is common but extremely difficult to manage, and becomes exponentially more error-prone as you scale. Despite the low cost of cloud storage, it is also expensive.

Consider this: a globally recognized brand that tracks personal information to help consumers reach their fitness goals saved millions of dollars in cloud data storage fees by enforcing data authorization dynamically on a single-source-of-truth dataset.

Making “secured copies” increases the size of your attack space (Gorodenkoff/Shutterstock)

Companies in the “secure copies” phase also struggle with constantly evolving security and regulatory requirements. As data access requirements change, engineers have to redo work, and, all too often, teams with limited resources simply abandon older copies. Your data attack surface has now multiplied, as have your storage fees.

Instead of consistency, businesses end up with an administrative nightmare. Frustrated data scientists and analysts can’t get data in a timely manner, and the risk of data breaches and regulatory non-compliance rises. Provisioning and managing two or more variations of a large dataset is expensive, time-consuming, risky, and ultimately unmanageable.

Striving for Compliance with Database “Views”

Defining policies as “views” is an unfamiliar approach to business leaders, which is a problem in itself. Behind dashboards and reports, one or more logical views are defined on top of database tables or other logical views, filtering data to meet data privacy and security requirements. In this context, views are an improvement over secure copies in that there is only one version of the data to manage (and pay to store).

However, when using logical database views for data security, businesses and auditors are challenged to understand how policies are defined, so enforcing and demonstrating compliance is difficult. A 451 Research survey report, Voice of the Enterprise: AI and Machine Learning Infrastructure, cited regulatory reporting or documentation as the number one regulatory challenge. But is your compliance team going to ask your data team to document every database view? It’s unlikely.

Strive to minimize the number of views into data

The most common problem data teams face is known as “view explosion.” Similar to secure copies, views have to be implemented in multiple ways to cover ordinary use cases, such as who gets to see data and in what format, making views multiply to the point of unmanageability.

Views have a place in database management, but they definitely were not designed for data security and privacy use cases. As many people work from home at least part of the time, personal computers, tablets, and smartphones should be denied access to sensitive data. There are also situations where you don’t want data accessed outside of specific work hours. Database views are not robust enough to pick up real-time context such as device, time of day, location, etc.

Another problem with database views is you have to implement the same set of (exploding) policies across all your analytics tools. This fragmented approach is inherently inconsistent from the start, and the cost/benefit analysis simply does not compute.

Build vs Buy: Extending Apache Ranger (Open Source)

Apache Ranger is a well-known open source solution for fine-grained access control. By Internet standards, it’s old, and comes from the fading Hadoop world — where data was big, but the number of data lake consumers was small.

Ranger paved the way for modern data access control, but it is not suitable for today’s cloud-first and hybrid enterprises. Any data team with the initiative to extend Ranger should be applauded for their ambition. But unless universal, dynamic data authorization is one of your organization’s core competencies, it’s almost guaranteed to fail

Ranger’s policy enforcement approach is tightly bound to the individual Hadoop systems for which it was designed. Building enterprise-grade software necessary to enforce policies on modern cloud data platforms like Snowflake, Amazon Redshift, or Azure Synapse is complex. The Ranger approach is limited by the data platform, weakening the ability to define and enforce rich, complete data access policies.

Ranger’s support for attribute-based access control (ABAC) is an underdeveloped “checkbox” that doesn’t scale to meet real-world challenges. Finally, Ranger requires you to define policies for each data platform, resulting in hundreds of redundant policies whereas more modern solutions might require only a dozen. Feeding policies with near real-time data attributes is fundamental to scaling data access control, which we’ll cover in more depth in part two.


Two common themes lie within these three mistakes: fragmentation and complexity. Technology silos provide barriers for business, and complexity is the enemy of security. Modern, scalable data authorization is needed to help people access, anonymize, and even remove sensitive and personal data.

In the next part of our series, we’ll describe how enterprises have found the “sweet spot” where big data can be used responsibly.

About the author: Nong Li is the co-founder and CTO of Okera. Prior to co-founding Okera in 2016, he led performance engineering for Spark core and SparkSQL at Databricks. Before Databricks, he served as the tech lead for the Impala project at Cloudera. Nong is also one of the original authors of the Apache Parquet project. He has a bachelor’s in computer science from Brown University.

Related Items:

Can Apple Right its Privacy and Security Cart?

Security, Privacy, and Governance at the Data Crossroads in ‘22

ML for Security Is Dead. Long Live ML for Security