Return of the Living Data
When Google published a paper on its proprietary BigQuery engine about nine years ago, the open source community reproduced the technology as best they could, just as they did with MapReduce and the Google File System, which eventually became Hadoop. But the data format that emerged from that effort, Apache Parquet, was not compatible with BigQuery or its ColumnIO data format. It was a dead end — until now.
Today, Google announced beta support in BigQuery for federated queries of Parquet and Optimized Row Columnar (ORC), the two most popular columnar data formats used in Hadoop data lakes. It also announced support for Hive partition tables, which are typically powered by Parquet. The news means Google Cloud customers suddenly have more freedom in how they store and analyze data.
Prior to this move, BigQuery users who wanted to analyze columnar data previously stored in Hadoop would have to move the data into BigQuery’s native format, ColumnIO. The engine also supported federated queries of data stored in the row-oriented Avro format, as well as CSV and JSON data (not to mention Google stores like Cloud Bigtable, Google Sheets, and Cloud SQL). But these formats are not as optimized for large-scale analytics as either Parquet or ORC, so customers faced a dilemma. (For more on the differences of these data formats, see “Big Data File Formats Demystified.”)
Customers who chose to move their data into BigQuery’s managed storage and use the ColumnIO (now Capacitor) format benefited in tangible ways. But moving petabytes of data brings its own set of challenges. And while Google makes this data migration relatively painless by waving some costs relating to data migrations into BigQuery, customers nevertheless showed resistance to doing that.
Now that BigQuery supports federated queries of Parquet, ORC, and Hive partition tables, Google Cloud customers can leave their data in Google Cloud Storage and query it remotely (i.e. in a federated manner) using BigQuery.
“We’ve invested over the years in becoming a more open data warehouse,” said Tino Tereshko, Google’s product manager for BigQuery. “This is opening up old storage to non BigQuery processing technologies like Hadoop and Spark, and it’s opening up our query engine to be able to reach out into these.”
Keeping the data in Parquet and ORC formats also reduced the chances of certain dark forces from being conjured up from the data afterlife, Tereshko writes in a blog post published called “Keep Parquet and ORC from the data graveyard with new BigQuery features.” “You don’t have to move any data, and you can be sure of the integrity of the data you’re querying—no evil twin copies lurking about,” he wrote.
A couple of hundred Google Cloud customers participated in the alpha of the new federated query feature. One of these customers, the streaming music company Pandora, is migrating petabytes of data from an on-prem Hadoop cluster into Google’s version of Hadoop, called Dataroc. Pandora uses Spark and Hive, but it also wanted to use BigQuery. That posed a challenge, until now.
“The support for Parquet and other external data source formats will give us the ability to choose the best underlying storage option for each use case, while still surfacing all our data within a centralized, BigQuery-based data lake optimized for analytics and insights,” Pandora product manager Greg Kurzhals says in the blog.
Another early user is Cardinal Health, which also is migrating from an on-prem Hadoop clusters to Dataproc. The company didn’t want to give up all the work its done with ORC, which is a format that was most popular among Hortonworks customers.
“We also wanted to leverage cloud-native options like BigQuery but without necessarily rewriting our entire ingestion pipeline,” Cardinal Health’s senior enterprise architect Ken Flannery says in the blog. “We needed a quick and cost-effective way to allow our users the flexibility of using different compute options (BigQuery or Hive) without necessarily sacrificing performance or data integrity. Adding ORC federation support to BigQuery was exactly what we needed and was timed perfectly for our migration.”
While query federation for Parquet and ORC gives BigQuery customers more options, they will likely not see the same level of performance as if they moved the data into the next-generation of BigQuery’s native store, called Capacitor.
“BigQuery Storage allows us to do things. We take control of the data,” Tereshko tells Datanami. “It allows us to do things that we can’t otherwise do with Parquet files. We can reprocess the data. We can move it really close physically to compute. We can collect additional statistics on the data. We can do a lot of things that you really can’t do with the files that you bring and give to us. With that, BigQuery’s native storage will have more functionality and it will probably have better performance.”
In any event, there are likely many customers who value the freedom to choose their data storage over the additional performance they get. And while Google isn’t likely to open source Capacitor anytime soon, it’s a good thing when major cloud providers like Google embrace giving customers more freedom.
It’s also fun to watch the re-unification of Parquet and ColumnIO, two technologies that have similar origins but for years have been incompatible with each other.
“It’s interesting to see that dynamic come full circle, now that we’re supporting Parquet as a first class citizen in BigQuery, even though it was based on top of BigQuery’s original format,” Tereshko said.