Roadmap to Distributed Data Stewardship
Given the vast amount of data that large enterprises have, it’s virtually impossible for a single department to know what and how much data exists organization-wide, let alone understand and manage that data. This means that reliably and rapidly managing access to that data for thousands of employees would be quite a strenuous undertaking. Why, then, are organizations still trying to manage access to their data through their centralized IT teams?
In order to provide comprehensive data access for users whilst ensuring proper governance and regulatory compliance, enterprises must shift toward a distributed data stewardship model. This is an organizational framework for delegating data management responsibilities throughout the enterprise.
The goal of a distributed data stewardship model is to allow teams that are closest to the data to manage access and permissions while eliminating the bottleneck that currently exists with centralized IT. However, given how much data infrastructure, management and governance have evolved over the years, this is not going to be an easy or quick task, and will need to take place in stages.
The Evolving Database
Corporate databases came into existence as a means of storing application data without having to embed storage into each application. They soon evolved into a storage solution that anyone in the company could use to access and analyze any dataset of value: transactions, customer records, sales metrics, employee records, etc.
When BI first came on the scene in the 90s, enterprises generally standardized on one BI tool and only one or two databases; data volumes were low and the degree of control was high. Access management was reasonably centralized at the BI and/or the database level, generally leveraging table-level, role-based access control (RBAC) policies.
Then came the decade of big data and the heyday of Hadoop, during which all the traditional ideas about what a database was and how it should be managed were set aside. “Just pour as much data as possible into the data lake and let everyone access it” became the new reigning philosophy.
However, the nature of big data meant that the data became far more disparate and distributed, and enterprises faced an explosion in the types of storage. This led to a fixation on the technology response to the “Three Vs” of big data: soaring volume, variety and velocity. For many reasons, data security became something of an afterthought.
Coming Full Circle
Organizations today face increasing business pressures – a competitive requirement to become more efficient and data-driven, balanced against the risk of brand damage from a massive data breach and the billion-dollar fines handed out to enforce an ever-evolving list of data privacy regulations. This has led to a desire to reestablish the feel of the well-governed environment of databases past.
This control must include granular permissions that can be easily granted or revoked, but without sacrificing the agility, productivity, and innovation that big data offers. The challenge is that we went from a world in which a few people needed access to limited amounts of mostly unregulated data to one in which perhaps tens of thousands of employees – and customers and partners – need access to massive amounts of highly regulated data.
Most organizations still want to keep access management centralized in IT. Only IT, they argue, has the tools and know-how for data management and auditing (because historically, they were the only ones doing it).
However, with so much data to manage, IT has become an access bottleneck that frustrates data consumers. And since IT lacks the insight and context into the data that the lines of business (LOBs) actually have, the overall quality of data governance also suffers.
Despite craving centralized control, many organizations have less control than ever over who has access to what kinds of data – and the threat of regulatory violations continues to increase. To not only solve the problem, but to also truly scale access control throughout the organization, a distributed data stewardship model is required.
A Roadmap for Distributed Data Stewardship
In an ideal distributed data stewardship model, the LOBs that are closest to the data are responsible for “ownership” of that data and hands-on access management, while governance standards are still consistently enforced across the organization with full visibility. However, getting to this point will require changes both in technology and organizational mindset, and will likely take years to achieve.
- Phase 1: Centralized IT, centralized governance, distributed data access management to the LOB for everyday user access
- The centralized IT team, typically led by a CDO (Chief Data Officer) or CIO (Chief Information Officer), will be responsible for activating the necessary technology changes to enable LOB control of access management. A centralized governance team will continue to be responsible for making organization-wide governance decisions and providing the top-down governance controls. IT will need to work with this team to actually enforce these governance standards and provide compliance auditing. Consequently, any technology changes here should be evaluated on how well they connect IT, the distributed data teams, and the centralized governance team.
- Phase 2: Centralized IT, decentralized governance, distributed data access management
- Decentralized governance teams who are specialists in their areas of regulation (HR, finance, etc.) work with the data stewards to manage governance for their LOBs. This will require new data access management technology that enables domain-based access federation across the organization.
- Phase 3: Decentralized IT, decentralized governance, distributed data access management
- A longer-term vision is the distribution of everything. Many companies are already unintentionally moving toward decentralized IT through the adoption of shadow IT. However, if enterprises accept the benefits of decentralized governance and data access management, then the needs of the LOBs can be met in a fully transparent way and receive any necessary support from centralized IT. In this phase, consistent “centralized” visibility remains critical, in lieu of the need for centralized governance teams. Trusted automated access management technologies will be a key enabler at this stage.
At a high level, what’s most important – and how organizations can measure their success as they move through these phases – is that centralized IT becomes less of a bottleneck and the LOBs are able to meet the data access needs of their teams more quickly, without compromising governance and compliance requirements.
Eventually, as more organizations reach a mature state of distributed data stewardship internally, they will be able to improve their ability to share externally (and therefore monetize) data in a well-governed and compliant way. For example sharing data with customers, partners and other organizations through “data-marketplaces.”
What We’ll See In 2021
Over the next year, we’ll enter Phase 1 with CDOs and CIOs distributing some control over access management to the LOBs. Getting there will require a culture shift, and CDOs and CIOs must sell this internally to their peers, who fear giving up control and authority, but still being held responsible if something goes wrong. Over time, they will see how distributing actually increases agility and time-to-value on their digital transformation initiatives.
Key distributed data stewardship capabilities we’re likely see to support this include:
- The ability to manage the data catalog and access permissions by business domain, so that data stewards can have full ownership and responsibility of the entire data lifecycle for their domain within the larger organization. This will likely require abstractions over metadata to enable data management by business domain, as opposed to just the technical structure of the source.
- Standardized policy templates based on corporate-wide data access policies set by the CISO and compliance officers for a variety of data access management tasks, such as persona-based access (e.g. what tasks should a data engineer be able to perform inside a domain versus a data analyst in the same domain). Easy access to these templates by the LOBs will allow distributed data stewardship to scale without requiring the distributed teams to understand all the underlying requirements. Changes to the centralized templates will be propagated across the organization as necessary.
- Real-time reporting insights into data access activity whether by domain or across the entire organization. This could include: who has access to sensitive data like social security numbers, how they are using them, who has access but doesn’t need it, etc. With the decentralization of control, this visibility becomes critically important for governance, and can also be used to power alerts.
- In the longer term, data lineage insights across the entire data lifecycle will become critical; not just in terms of the “hops” data has taken through the data pipeline, but also visualizing how data is traveling through an organization between domains, and clearly understanding when data entered the organization, how it was shared, and how it was removed.
Distributed data stewardship will have a transformative effect on businesses. Driving more effective data governance through data stewards, who have a better understanding of the data, and resulting in increased competitive advantage through greater productivity and agility for data teams. 2021 may just be the start, but it will be the start of a very exciting and necessary journey.
About the authors: Lars George is the Principal Solutions Architect at Okera. Previously, Lars was the EMEA chief architect at Cloudera, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data-driven solutions, and a co-founding partner of OpenCore, a Hadoop and emerging data technologies advisory firm. He has been involved with Hadoop and HBase since 2007 and became a full HBase committer in 2009. He’s the author HBase: The Definitive Guide from O’Reilly and the co-author of Architecting Modern Data Platforms.
Roslyn Coutinho is a Senior Product Manager at Okera, where she has spent the last four years working with large Fortune 500 customers in understanding and solving their challenges in scaling secure access to their data. Coming from a design background, she’s motivated by designing an amazing user experience to simplify the complex challenge of policy management at scale.