Google Enters Data Catalog Business, Updates BigQuery
Google today rolled out a data catalog that will eventually give customers visibility into all of their data assets in the Google Cloud and beyond. It also bolstered BigQuery with support for materialized views and a new method for exporting SQL-based machine learning models in the TensorFlow format.
As the big data wave continues to crash into enterprises, data catalogs have become must-have tools for making sense of the digital sprawl. In addition to giving data analysts and data scientists visibility into a wide variety of available data, they can also provide governance and security controls to help prevent sensitive data from being accessed.
As the foremost search authority, it shouldn’t be surprising that Google Cloud’s new Data Catalog has a search engine at the core. The new offering, which is generally available, uses Google’s powerful search technology to surface available data residing in BigQuery and Pub/Sub, which are the first data repositories that Google is supporting with the catalog.
But as Google Cloud’s Director of Product Management for Smart Analytics, Sudhir Hasbe explains, the core search capability in Data Catalog is infused with a security design that prevents unauthorized users from viewing sensitive data. That’s a key capability that allows customers to abide by GDPR, CCPA, and other new data regulations.
“Organizations need to know what data assets they have and who has access to what data,” Hasbe tells Datanami. “That’s becoming a very common theme over the last couple of years.”
The new data catalog features crawlers that automatically ingests metadata from Google Cloud assets, starting with BigQuery and Pub/Sub, and extending to Google Cloud’s object storage system later this year. That metadata is used to surface data to the user interface through a search engine. But not every user has access to ever piece of data, and those restrictions are based on tags that the administrator applies to the data, Hasbe says.
“The most important thing in that search technology is it’s secure by design from the beginning,” he says. “If there are data asset you want to secure, you say nobody can even look for them. You apply that. That’s what Data Catalog allows you to do.”
For example, if the customer is storing sensitive data, like passport numbers or Social Security numbers, the data administrator can apply a tag to those pieces of data that prevents unauthorized access. Data Catalog can automatically discover many sensitive data types too, Hasbe says.
Data Catalog is currently just limited to BigQuery, but Google plans to support its object store and other Google Cloud assets later this year. The company also has plans to expand to on-prem data sets, and even other clouds. It will be leaning on partners and customers for this, Hasbe says.
“One of the things we have in the product is an API that anybody can use to publish other metadata assets,” he says. “For example, customers can take data from Teradata or something on-prem and publish it into the catalog themselves. And over time we will expand it to different assets on-prem, or other cloud stores.”
Google is working with its partner Collibra to expand its catalog outside of the Google realm. “Collibra is a good partner,” Hasbe says. “The plan is to have a two-way sync between the two systems so they can go ahead and get access to GCP and GCP would get access to what is already available on-prem.”
BigQuery is the subject of two other announcements Google made today, including the launch of a beta for materialized views.
With materialized views enabled on BigQuery columns, customers will benefit by having fast access to the latest pre-aggregated data, which customers often display in dashboards. This will eliminate the need for customers to run full table scans to get the latest results as new data arrives.
When customers activate materialized views, BigQuery will continually run queries as fresh data streams in, via Pub/Sub or other mechanisms. Because the materialized views are continually pre-aggregated and stored on disk, it will help reduce consumption of processing resources while delivering faster access to query results.
“You’re basically spending a really low price on the storage side to get really high performance compared to a query that would run on all the data,” Hasbe says.
In other BigQuery news, Google is working to lower the barrier to how customers work with machine learning. It’s doing this by enabling data analysts to create models using nothing but SQL, and then output the results as a TensorFlow model that engineers can deploy into production.
“What we’re doing is democratizing machine learning by enabling analysts to go ahead and build these models without being an expert data scientist, and then having the capability to export these models and put into production so that you can scale to the Web-scale needs of that particular organization by simplifying that code,” Hasbe says.
Customers can deploy the TensorFlow model using the Google Cloud Machine Learning Engine or an open source equivalent, like Kubeflow. “You can pick any machine learning platform service that would support TensorFlow as a format,” Hasbe says.
Google also added two new machine learning algorithms to the mix. The first one is K-means clustering, which is useful for customer segmentation (it was actually added last year). The second one is matrix factorization, which is useful for product recommendations.