Celebrating Data Independence
Every company wants the independence to do what they wish with their data. That’s one of the first assumptions underlying this whole big data movement. But depending on where and how a business stores its data — such as in proprietary formats, whether on-prem or the cloud – users may inadvertently limit their data freedom going forward.
Enticed by cheap and abundant storage and the flexibility to scale compute resources as needed, customers are moving exabytes of data from on-prem systems to object storage systems in the cloud. Hadoop, as the preeminent on-premise big data storage system, stands to lose a good chunk of mindshare and marketshare in the process, while the three major cloud platforms — AWS, Microsoft Azure, and Google Cloud – are capitalizing on the trend and growing very fast.
In addition to minimizing Hadoop’s influence, this great migration of data to the cloud is helping to shake up the analytics market too. Instead of writing their software to run on Hadoop, vendors are now forced to be much more agnostic about where the data lives. That means supporting not just Hadoop, but multiple cloud and hybrid deployments that include cloud and on-premise systems.
“Every Fortune 2000 company has 10 different vendors that have data that they want to access holistically, but they can’t because of the vendors,” says Chris Lynch, the CEO of analytics software vendor AtScale. “You can’t have big data unless you have all the data. That’s the most important asset that any company has.”
A typical bank might house retail data on one vendor’s system and loan origination data on another system, Lynch says. “And that data isn’t easily aggregated to analyze. How archaic is that? And only because it runs on two different vendors’ systems,” he says.
Integrating Data Silos
Two years ago, AtScale was almost entirely focused on speeding analytic queries for data residing in Hadoop. When the great cloud migration started to pick up steam, Lynch pivoted the company to support cloud data warehouses, which is paying dividends for AtScale.
“I know that being a universal semantic layer is the key to our independence and our continued growth,” Lynch says. “Nobody wants vendor lock-in. It’s part of the reason they’re interested the cloud and this new world order. They don’t want to go from Teradata to Amazon. They want to go from Teradata to Amazon to BigQuery to Microsoft and Snowflake, and have that flexibility. That’s the power of AtScale. We can be friend or foe.”
Hadoop is getting displaced now by object storage in the cloud, but don’t be surprised if the cloud-based object stores are displaced at some point in the near future, according to Haoyuan “HY” Li, the founder and CTO of Alluxio.
“Every five to 10 years there will be another wave of innovation from the storage side that will produce further ease of use than the previous generation. That’s the status quo of the data ecosystem,” Li says. “The reason for that five-to-10 year cycle is due to three dimensions. There is architectural innovation. There is hardware innovation. And there’s workload advancement. That drives the five-to-10 year cycle, and that’s the foundation of my PhD thesis at Berkeley as well.”
Li developed Tachyon as part of that PhD work at UC Berkeley’s AMPLab. Now called Alluxio, the software serves as a neutral data orchestration layer that allows customers like Barclays, Samsung, and Wells Fargo to connect their compute infrastructure to any storage backend.
“All the storage vendors, their goal is to kill the data silos. Our position is to embrace them. That’s a fundamental difference,” Li says. “Alluxio is an implementation of this data orchestration system, which virtualizes data from different storage or different storage deployment, and abstracts them, and present a global name space and an industry standard API to all the data-driven applications.”
Open Data Formats
We’ve clearly moved past the old days, when companies would routinely spend millions of dollars building on-premise data warehouses on proprietary systems using proprietary data formats. But even in the new world order, customers can get locked in with clouds, says Justin Borgman, CEO of Starburst Data, the commercial entity behind the Presto SQL query engine.
“Data lock-in is vendor lock-in. It’s the worst form of vendor lock-in,” Borgman says. “If your data is locked into a certain format, your ability to move to another vendor or even have other tools read from that data is severely hindered.”
Unlike some cloud query offerings, the Presto query engine works with data stored in a variety of formats, including ORC, Parquet, Avro, JSON, XML, and even CSV files. Borgman considers that a strength, especially when companies are expanding from data warehousing to other data workloads.
“You can actually have data scientists training a machine learning model working from the same data sets as your BI analyst who are working with Tableau, Looker, or pick your favorite BI tool,” he says. “It just gives customers a lot of optionality, a lot of freedom, a lot of flexibility that you don’t get in that traditional proprietary data warehousing world.”
While Hadoop’s influence has waned, the platform leaves behind a rich legacy of openness that will influence data architectures into the future, Borgman says. “I think they’re the lasting pieces [of Hadoop], both the data lake concept and these open data formats,” he says.
There’s nothing stopping customer from moving their data from Hadoop into the cloud. ORC, Parquet, and other open data formats can live in S3 or ADLS Gen 2 object stores just as easily as they can live in on-prem Hadoop clusters. Plenty of customers, of course, are taking advantage of this freedom to move their data.
“The idea of a data lake is still brilliant and totally relevant and I think will last for decades. But what that data is composed of is pretty radically changing,” Borgman says. “It’s good for the customer in the sense that Hadoop didn’t have a lot of lock-in. But it’s hard for those [Hadoop vendors] that people are just moving that data right into the cloud and accessing it there.”
But companies who are turning off their Hadoop clusters and moving data into the cloud should give some serious thought as to how they’re storing the data. The cloud looks like a great deal now, but as we’ve seen, things can quickly change.
“AWS is locking you in. Azure is locking you in,” Borgman says. “As long as your data is in an open data format, you have an out. Even though it may have a lot of gravity, you can move it all from AWS to Azure and again create there. The open file formats give you flexibility.”