Follow Datanami:
April 4, 2022

The Modernization of Data Engineering at Capital One

(JL IMAGES/Shutterstock)

Like many enterprises, Capital One Financial Corp. is in the process of democratizing employee access to data to improve profitability, lower risk, and increase customer satisfaction. But lowering the barriers to data access in a decentralized environment brings substantial challenges, particularly as it relates to data visibility, governance, and control. As Capital One modernized its data systems, it looked to one emerging best practice in particular to help it on its journey.

From its headquarters in McLean, Virginia, Capital One Financial Corp. runs a diversified financial organization spanning consumer credit cards, retail banking, savings accounts, and auto loans. The 28-year-old company, which is known as one of the most technology-focused banks in the U.S., had 2021 revenues of $30.4 billion and is a component of the S&P 500.

Capital One has been at the leading edge of the curve when it comes to modernizing its IT landscape. The company is a big user of the public cloud, with Amazon Web Services serving as its primary provider for cloud compute and storage. It also has found success with Snowflake, which hosts a company-wide data warehouse that contains 45 petabytes.

While the company has standardized on the cloud, that doesn’t mean that all of its data processes are centralized. In fact, as the volumes of data, data use cases, and data users has gone up, the tools and techniques the company uses to enable data access become more decentralized, according to Capital One’s SVP Enterprise Data Platforms, Biba Helou.

“We’re modernizing pretty much all of our systems, whether they be transactional or operational–we’ve been working on that for quite a bit and that we made a lot of strides there,” Helou says. “But it’s the leveraging data that comes out of a lot of those systems” that has made a big difference.

Capital One created a single data marketplace to serve 7,000 internal analysts (ramcreations/Shutterstock)

On Snowflake alone, Capital One currently has 7,000 data analysts running millions of queries per day for more than 400 use cases across multiple lines of business, Helou says. The use cases range from training chatbots based on customer interaction data to analyzing log data to ensure the reliability of its data systems, and pretty much everything in between.

It’s Helou’s job to make sure that data analysts have access to the data they need, and that it’s accurate, secured, and well-governed. To achieve this goal, for the last several years Capital One has been building what amounts to an internal data marketplace that enables users to access data they’re cleared to use. It’s one part data catalog, one part automated data pipeline development tool, and one part data governance/quality product, held together with micro services with a healthy dose of data mesh.

“That’s probably our key core effort that we put a lot of time and effort into, is creating a place where all of our users, regardless of persona, can go and find out what’s at their disposal,” Helou says. “Think about it as being able to go to an Amazon and look up what you can shop for. You can go to one place and take a look and see what data is available to you and then what format it’s available to you in, and then have that one-stop shop, to be able to request access for it and to be able to pick it up and use it.”

This is the way that Capital One distributes most of its data to analysts, data scientists, and other internal stakeholders. Whether it’s credit card from transactional systems, from internal systems, or from the company’s core financial system, Capital One’s data consumers can be confident that the data will be in this marketplace. “It runs the gamut,” Helou says. “It’s pretty much almost everything.”

Capital One built this data marketplace itself, which enabled it to customize it to its exact needs. During its development, the company explicitly borrowed elements from the data mesh concept to enable disparate teams to manage data themselves, thereby eliminating the need for a top-down approach.

According to Helou, the data mesh approach has been important to the data marketplace’s success.

From chatbot training to log analysis, Capital One’s analysts can find the data they need in the data marketplace (sdecoret/Shutterstock)

“We don’t have a central owner of data that is going to curate all data for you and then vend it out,” she tells Datanami. “So the concept of a data mesh, of the domain owners own their data and own their data products” has been important, she says.

At the same time, processes were put in place to ensure that the quality of data being distributed through the marketplace was up to standards, Helou says.

“For us, it’s all about transparency,” she says. “The shopper for data has this place to find the data that they need. They have the ability to see what kind of data quality rules are being applied to it. They have the ability to know where it’s being produced from, and then they have the option to select the datasets that they’re using based on their needs.”

The data marketplace also provides mechanisms to help the data owners enforce their team’s individual data access rules, which is a core tenet of the data mesh approach.

“It’s all about that place to go look for the data,” Helou says. “It will also tell you about the data so you can get determination on whether it’s good data or the right data for the right purpose. And then all business lines have their own governance steps that are built around their own reporting or their own products that they’re creating….So we are not taking over the individual line of business and creators of product. We’re not taking over their governance processes and their ability to check that they’re producing.”

It wasn’t that long ago that simply getting access to data required a lot of things to go right for Capital One. Just knowing where the data was physically located–or knowing somebody who did–was a significant barrier to data enablement, according to Helou.

Capital One is one of the more tech-forward banks (DCStockPhotography/Shutterstock)

“If you look back to the old days, it was either we were creating a curated set of data in your traditional warehousing solutions, just like everybody else had, and you had to know it was there and you went and used it. So everything was centralized,” she says. “The other way was shoulder tapping. We had some cataloging ability, but it wasn’t as broad. It didn’t allow for as much discovery from our data users.”

The data marketplace concept has allowed Capital One’s data consumers to keep up not only with the growth of data volumes and the proliferation of data silos–which is a considerable task to begin with–but also to keep up with the wide diversity of data use cases, ranging from basic SQL analytics and streaming analytics to machine learning and AI, and the complex technologies they bring.

“That’s why I call it a data ecosystem,” Helou says, citing uses for frameworks like Apache Spark, Apache Kafka, and more. “You’ve got different performance characteristics that you have to worry about. You have different formats that you have to worry about. So we are building a system that is able to leverage different tech stacks to use that data fit for purpose, essentially.”

Without the data marketplace, who knows where Capital One’s internal users would be now? It’s hard to think that the company would go back to the early days of data warehousing, which couldn’t keep up with the big data of 2012, let alone 2022. As Helou sees it, the data marketplace has been instrumental to Capital One’s data success.

“We have definitely been able to unleash the potential of data,” she says. “Data is growing at a rapid pace everywhere and we are able to process so much more information now and give access to much more information to our users.”

Nothing ever stands still in data, however, and so the work continues for Helou and her team. As more data comes in, latency and performance become bigger issues. “People just want more, faster, more, faster,” she says. “So that’s what we’re continuing to try to work with.”

Related Items:

What to Look for in a Data Catalog

In Search of the Modern Data Stack

How ‘Purple Rain’ Bolsters Security Intelligence for Capital One