Follow Datanami:
February 24, 2020

Streamsets Expands Databricks Partnership Extending Ingestion Capabilities for Delta Lake

SAN FRANCISCO, Feb. 24, 2020 — StreamSets, provider of the industry’s first DataOps platform, today announced an expansion of its partnership with Databricks by participating in Databricks’ newly launched Data Ingestion Network. As part of the expanded partnership, StreamSets is offering additional functionality with a new connector for Delta Lake, an open source project that provides reliable data lakes at scale. With it, users can configure their pipelines to write data from any source moving in batch or streaming mode directly into Delta Lake. Now, data teams can deliver all of their data in a shorter time frame, driving BI, analytics and ML.

Today, companies require systems for diverse data applications like real-time monitoring, machine learning and data science — and that can process unstructured data like text, images, video and audio. A decade ago, data lakes replaced data warehouses as the best repositories for this raw data; however, they neither support transactions nor enforce data quality. In addition, they lack consistency, making it almost impossible to mix batch and streaming jobs and appends and reads.

Leveraging the best of data warehouses and data lakes, lakehouses remedy the above limitations, but friction ingesting fresh data remains. With this partnership, Databricks users will now be able to capitalize on the new lakehouse paradigm without the friction previously encountered. They can easily connect into StreamSets Cloud and leverage out-of-the-box connectors to load batch, change data capture (CDC) or streaming data from any source (such as cloud applications, relational data, on-premises data lakes and warehouses) into Delta Lake. With StreamSets, data engineers can easily build and operate data pipelines for modern and legacy data sources to migrate to a lakehouse and continuously refresh with relevant data.

Specifically, the new StreamSets connector for Delta Lake enables several key benefits for even greater operational control over the full life cycle of data:

  • Faster migration to the cloud with fewer data engineering resources
  • Drag-and-drop interface to simplify data movement from multiple disparate sources
  • Improved management of operations and performance for lakehouses
  • Change-data-capture capability from several data sources into Delta Lake
  • Built-in Kubernetes containerization and native cloud scaling

Combined with Delta Lake which provides ACID transactions, the connector also makes it possible to unify batch and streaming data to support the timeliness of transactional operations.

“Databricks Ingest brings an opportunity for organizations to build a central lakehouse without worrying about repetitive data movement,” said Michael Hoff, senior vice president of Business Development and Partners at Databricks. “With StreamSets’ expanded support for Delta Lake, small and midsize companies now have an easy way to ingest data from their cloud-based service into Delta Lake so they can maximize their analytics efforts with fresh data in their lakehouse.”

“This connector is another step forward in our alliance with Databricks to deliver more data, faster, to drive traditional BI and machine learning initiatives — which is critical to the survival and success of today’s organizations,” said Jobi George, general manager of Cloud Business at StreamSets. “We’re excited to continue our work with Databricks to drive innovation in the industry.”

The connector is currently available for Databricks customers.

To learn more, save a spot in Databricks’ upcoming webinar Accelerate building lakehouses for Business Intelligence and Machine Learning.

About DataOps

Analytics has modernized in our always-on, always-changing world. How you deliver data to drive analytics has to modernize, too. DataOps is a set of practices and technologies that operationalizes data management and integration to ensure resilience and agility despite ceaseless change. It combines the DevOps principles of continuous delivery with the ability to tame data drift (unexpected and undocumented changes to data). By embedding these principles, DataOps makes it possible to deliver the continuous data needed to drive modern analytics and digital transformation.

About StreamSets

StreamSets built the industry’s first multi-cloud DataOps platform for modern data integration, helping enterprises to continuously flow big, streaming and traditional data to their data science and data analytics applications. The platform uniquely handles data drift, those frequent and unexpected changes to upstream data that break pipelines and damage data integrity. The StreamSets DataOps Platform allows for execution of any-to-any pipelines, ETL processing and machine learning with a cloud-native operations portal for the continuous automation and monitoring of complex multi-pipeline topologies.

Source: StreamSets