Databricks, Partners, Open a Unified ‘Lakehouse’
Coalescing around an open source storage layer, Databricks is pitching a new data management framework billed as combining the best attributes of data lakes and warehouses into what the company dubs a “lakehouse.”
The new data domocile is promoted as a way of applying business intelligence and machine learning tools across all enterprise data. The company and its lakehouse partners also have assembled a “data ingestion network” that allows users to load siloed data into Delta Lake, a storage layer released by Databricks to the open source community last year.
Among the applications that can be integrated into the lakehouse are Google analytics, Salesforce and SAP along with Cassandra, Kafka, Oracle, MySQL and MongoDB databases. Those along with mainframe and file data would be available in one location for BI and machine learning use cases.
As it aims to develop an enterprise AI platform, high-flying Databricks and it network partners are attempting to fuse traditional structured data with unstructured volumes while combining BI and machine learning use cases. Siloed data lakes and warehouses result in “slow processing and partial results that are too delayed or too incomplete to be effectively utilized,” Ali Ghodsi, Databricks’ co-founder and CEO said this week in introducing the lakehouse framework.
Lakehouse “aspires to combine the reliability of data warehouses with the scale of data lakes to support every kind of use case,” Ghodsi added. “In order for this architecture to work well, it needs to be easy for every type of data to be pulled in.”
Along with enterprise analytics applications and databases, data can also be pulled into Delta Lake from cloud file storage service like Amazon Web Service S3, Google Cloud Storage or Microsoft Azure data lake storage. Databricks said other integrations would be available soon from Informatica, Segment and Stitch.
The lakehouse partner network includes Fivetran, Infoworks, Qlik, Steamsets and Syncsort. Qlik said Monday (Feb. 24) it is deploying its data integration platform with Delta Lake, enabling the ability to automate and stream data to the cloud from mainframes, data warehouses or databases, then applying cloud-based analytics tools.
The unified storage layer would allow users to run machine learning along with traditional business intelligence workloads on a single lakehouse, added George Fraser, CEO of network partner Fivetran.
In donating Delta Lake code last year, Databricks noted the open source project targets shortcomings in data lakes as structured and big data are combined. Among them are poor data quality, unreliable read and writes and degraded performance as data lakes fill up.
The lakehouse framework is therefore promoted as combining the reliability of data warehouses with the scaling capability of data lakes to support emerging machine learning use cases.
To that end, Delta Lake includes ACID transactions between rewrites along with schema management, data versioning and “time travel,” a reference to the ability to view older versions of a table or directory when new file versions are created.