Hiding Hadoop: How Data Fabrics Mask Complexity and Empower Users
“The early bird gets the worm,” the old saying goes. What they don’t tell you is sometimes the early bird gets eaten. In the big data world, the rapid evolution of technology is providing maneuvering room for second-movers, and one place where this advantage is becoming evident is in the emerging area of data virtualization and data fabrics.
Consider the story of Vizient, the Irving, Texas-based group purchasing organization (GPO) created in 2015 through the consolidation of VHA Inc., University HealthSystem Consortium, and Novation. In addition to helping members to negotiate better deal on the back of a combined $100 billion in purchasing power, the company also provides an array of analytical services to its clients, which includes nine out of the top 10 hospitals in the United States (as defined by U.S. News and World Report).
The job of navigating Vizient through the complex intersection of big data technology and business services falls to Chuck DeVries, the company’s vice president of enterprise architecture and strategic technology.
In a recent interview with Datanami, DeVries explained how data virtualization and data fabrics have become critical elements for simplifying Vizient’s big data strategy to end users. It starts with the IT.
“We’ve got a lot of everything,” DeVries says. “Microsoft stack, Linux stack. We’ve got Oracle and SQL Server. We’ve got big data stuff, Hadoop. If there’s a provider of analytics, chances are we’ve got some of those.”
While Vizient has its share of enterprise data warehouses with names like Teradata, Vertica, and Exadata on the side of them, the company has only recently begun exploring “modern” data architectures, such as distributed, scale-out systems running Hadoop. The company’s next-gen data architecture is based on a Hadoop distribution from Hortonworks.
“We’re a little bit lucky in that we were a little bit of a laggard to the big data or Hadoop space,” DeVries says. “We didn’t fall trap to the ‘Fill everything into the soup bucket’ in and then expect Hadoop to magically spit out the data that we want. We’ve been able to be a little bit more deliberate about putting in what matters and then starting with some of those clean things.”
Vizient has only been using Hadoop for about two years, but it’s already recorded some successes with those early Hadoop projects, and users are asking for more (which is a good problem to have. DeVries credits that early success to having solid processes and procedures around the big data strategy.
“Primarily it’s a process, and it’s an understanding of the people who are involved,” he says. “What we try to do whenever we’re putting anything into the [Hadoop] repository is we know what the uses are going to be, or at least we know what the immediate uses are going to be. And then we track that with ownership as…so we can maintain it and clean it up, if you will.”
Vizient uses its Hortonworks Data Platform (HDP) cluster as a landing zone for structured and semi-structured data as it flows into the company, through its various analytical databases, and out to power users equipped with Tableau and Excel. Early on, the company realized the importance of tracking metadata right, and avails itself of open source tools in HDP, like Apache Atlas, to track data lineage, and Apache Ranger to enforce access policies.
Keeping a strong, consistent view of data as it flows across the different systems is critical for Vizient. In this respect, the company is a strong believer in the power of data federation, and an adopter of the data fabric concept. It uses data virtualization software from Denodo to connect the dots between these disparate data sources, and software from Paxata to ensure that the data is clean and consistent.
“With the advent of systems like Hadoop and the advent of things like SOA…it becomes a lot easier to be able to federate that but still come up with a consistent view,” DeVries says. “We know we have different users that have particular skill sets. So I’ve got one group that’s absolutely stellar in supply and another that’s absolutely stellar in clinical. We want to be able to expose data without having to mash everybody together into a single view.”
One Version of the Truth
With so many different systems, businesses, and users involved, providing a single consistent view of the data is absolutely critical, DeVries says.
“We want to make sure we’re always giving the same answer to our customers,” he says. “One of the worst things that could possibly happen is somebody calls and talks to one person in the organization and they get one answer about their opportunity for savings, and they call somebody else and get a different answer.”
Vizient relies on Denodo to attach to the different information stores, pull the relevant data out of them, and present the data in a consistent and unified format that the company’s stakeholders can rely on, DeVries says. “You can either let people fight over which data source is authoritative or you help them guide to the right one,” he says.
Most of Vizient’s users are non-technical folks, at least as far as big data is concerned. They’re experts in using Tableau or Excel to present the information they need in the format that’s important to them.
“We use Denodo to effectively build data REST-based data services, and then we expose those out,” DeVries says. “The beauty of the fabric is that not everybody needs to know all that [Hadoop] stuff. They don’t necessarily need to know about Hive or Pig or any of the access pieces. We’re able to abstract that out to be able to say, here’s the data source that you want.”
Success Through Simplification
Vizient hasn’t completely avoided all pitfalls with its data fabric rollout, and DeVries has learned some lessons. “It’s never as simple as it sounds,” he says. “A lot of tools coming out now are really, really powerful and do a lot of things for you, but it’s not always the right thing.”
When the queries to Vizient’s Salesforce database started to pile up, the company quickly realized it needed to modify default configuration settings associated with Denodo’s polling of the cloud data store. But these are tuning issues that are to be expected to be encountered when adopting new architectures.
By doing the hard part of mapping out a clear data access strategy up front and then adhering to that strategy, Vizient has been able to stand on the shoulders of giants and leverage big data without the risks of those who came before. By hiding Hadoop behind the curtain, Vizient’s data fabric empowers users to concentrate on their jobs.
“We have all kinds of machine learning magic happening behind the scenes,” DeVries says. “But when you’re dealing with a consultant who’s working with a member, they really know their particular domain, like nurse staffing. They’re never going to be a Hadoop or Hive expert, and we don’t want them to be. You want them to be able to tap the data that they need and to be able to move that forward.”
“That doesn’t mean that we’ve never fallen down into that swamp approach,” he adds. “But we’ve been able to keep it a little cleaner.”