Follow Datanami:
July 5, 2019

Data Privacy and Smart Streaming Discovery: Getting Past the Furor

Rohit Mahajan


There’s a lot of furor worldwide about all of the data privacy regulations that are either in place, about to go into place, or are being debated as to whether they should go into place.  Just because there are rules like the General Data Protection Regulation (GDPR), it doesn’t mean everyone is following them, or is necessarily compliant.

So, while there may be bewilderment about the rules and what they can and should be capable of doing, there is one point that should be made very clear:  the rules will ultimately be sorted out, and companies should begin their efforts to comply with their spirit, as opposed to the mere letter of the law. Doing that will require them to deploy advanced data discovery efforts to figure out exactly what information they have under their control and where it resides.

But it really goes far beyond that. Data privacy is at the heart of all of these rules, whether it’s GDPR, California’s forthcoming Consumer Privacy Act, the Washington Privacy Act or more.  They all focus on Personally Identifiable Information (PII), which the U.S. government says “refers to information that can be used to distinguish or trace an individual’s identity, either alone or when combined with other personal or identifying information that is linked or linkable to a specific individual.”  Your name, address, Social Security number, phone number, bank account details…all of it, PII.


It’s in that context that companies must realize that PII flowing into their systems becomes immediately part of their privacy efforts, and that they must be accountable for its tracking regardless of whether the data is at rest or in motion.  The instant they have PII, they are responsible for making sure it’s not accessed or used in a manner that violates the law, either in a civil or criminal manner.

The sheer amount of data, meanwhile, continues to escalate: IDC’s annual report on the growth in data estimates that worldwide data will grow to 175 zettabytes within five years.  The data is coming from virtually every source imaginable, but the explosion of data from Internet of Things (IoT) devices has the potential to dwarf that growth; the IDC study estimates 90ZB of data, or more than half of the total amount, will be created on IoT devices by 2025.

IDC’s analysts go on to estimate that by 2025, “each connected person will have at least one data interaction every 18 seconds. Many of these interactions are because of the billions of IoT devices connected across the globe.”  That data, in motion, will be flowing across and into corporate data lakes and warehouses literally every second of every day, which presents companies with both a challenge as well as an opportunity.

The challenge, as mentioned above, is ensuring that data — much of it PII — is accounted for from the moment it enters the corporate ecosystem. However, companies that can manage that data in all its forms will have the opportunity to gain and maintain the confidence of their customers and prospects, who see these firms as trustworthy guardians. This has the potential to deliver positive benefits in terms of reputation, cost, streamlining processes and more.


Which is why many companies and their Chief Data Officers are already reviewing the potential for “Smart Streaming Discovery” in their organizations. They know that GDPR and other emerging laws and regulations will become increasingly consistent and specific on the compliance requirements. Artificial (or augmented) intelligence will be required to power the machine learning needed to track data throughout petabytes of information located across multiple sites.  And the ability to detect the data, whether it’s at rest or in motion, will become paramount. Put another way, you can’t search what you don’t know you have.

Leveraging advanced technology and machine learning algorithms, Smart Streaming Discovery enables organizations to think and act in real-time, benefitting from the timely analysis and extraction of insights from data streams to discover data “in motion,” as opposed to stored data “at rest.” From the point of data ingestion, organizations should be able to quickly and automatically detect PII and other streaming sensitive data in structured, semi-structured and some unstructured formats. Leveraging deep learning techniques to automatically tag data as sensitive and flag it before it lands in data stores, to proactively manage PII and sensitive data, will allow Subject Matter Experts (SMEs) time to focus on remediation activities, and the organization to move towards automated data governance.

While the requirements and the standards of the privacy laws are not yet standardized, they will be. They must be, for consumers to gain a uniform of level of trust.  As such, the technology which is making compliance a reality must look ahead to what’s next.  It is already no longer good enough to merely think of discovering, managing and understanding data at rest.  Smart Streaming Discovery will be needed, to help everyone address this issue, and to deliver solid business benefits in the process.

About the author: Rohit Mahajan is the CTO/CPO of Io-Tahoe, a provider of machine learning-powered data discovery solutions. Rohit is an ex Wall Street executive turned entrepreneur who is passionate about developing disruptive technology for data discovery using machine learning. He is an experienced technologist with a proven track record of implementing global solutions at financial institutions for DevOps, testing, security and data center transformation. In his 20 year technology career, Rohit has held a number of senior roles at Dun and Bradstreet, Morgan Stanley, and Deutsche Bank. 

Related Items:

Global DataSphere to Hit 175 Zettabytes by 2025, IDC Says

ML Powers Discovery In GE’s 500 PB Lake

The Wild West and Last Frontier of Big Data