September 5, 2019

StreamSets Eases Spark-ETL Pipeline Development

Alex Woodie

Apache Spark gives developers a powerful tool for creating data pipelines for ETL workflows, but the framework is complex and can be difficult to troubleshoot. StreamSets is aiming to simplify Spark pipeline development with Transformer, the latest addition to its DataOps platform. The company also unveiled the beta of a new cloud offering.

StreamSets, which is hosting its annual user conference this week in San Francisco, is making a name for itself in the big data world with its DataOps platform. The suite’s main focus is to simplify the task of creating and managing the myriad data pipelines that organizations are building to move data to where they need it, with all the requisite security, governance, and automation features that users demand.

The new Transformer software unveiled today sits in the Data Plane portion of Streamsets DataOps platform, which is where the data pipelines are created and managed. The other portion of DataOps is Control Plane, which is basically a configurable GUI management console.

After creating a new data pipeline in its drag-and-drop GUI, Transformer instantiates the pipeline as a native Spark job that can execute in batch, micro-batch, or streaming modes (or switch among them; there’s no difference for the developer).

With Transformer, StreamSets aims to ease the ETL burden, which is considerable. “Our initial goal is to ease the burden of common ETL sets-based patterns,” the company tells Datanami. “ETL and related activities have to be done to produce results in the downstream analytics, but often new technology like Apache Spark is not easily adopted across every organization.”

The StreamSets DataOps platform

The explosion of big data is changing the design patterns at organizations, StreamSets says. Gone are the days when the application design would dictate the type of data. It’s flipped around, and today organizations are looking to tailor their applications based on the available data. The whole DataOps platform – and Transformer specifically – simplify the creation of the pipelines that move data to the applications.

“Data pipelining is a necessary evil at any data-driven organization today, regardless of appetite,” StreamSets says. “The days where all insightful data lived within the walls of the EDW are far gone. Static connections to enterprise systems aren’t flexible enough to work with modern platforms.”

The new offering will leverage the power of Spark without exposing users to some of the burdensome intricacies of the distributed, in-memory framework, including monitoring Spark jobs and error handling. “In essence, StreamSets Transformer brings the power of Apache Spark to businesses, while eliminating its complexity and guesswork,” said StreamSets CTO Arvind Prabhakar.

Transformer works with other components of StreamSets Data Plane offerings, including Data Collector, which offers over a hundred connectors for source and destinations data repositories. Data Collector Edge, Dataflow Sensors, and Dataflow Observers tackle IoT, data drift, and pipeline monitoring, respectively; the whole DataPlane suite runs on Kubernetes.

ETL is a main focus, but it’s not the only use case for Transformer. The new offering will also support SparkSQL for utilizing the SQL processing capabilities of Spark. StreamSets says it contains custom Scala, Tensorflow and Pyspark processors, which allow users to design machine learning workloads “out of the box.” More machine learning and complex event processing functionality will be delivered later this year, the company says.

Meanwhile, the San Francisco company also announced the launch of its new cloud offering, StreamSets Cloud. While customers have already run one or more parts of the DataOps suite on the cloud, this is the first time that StreamSets has gotten into the software as a service (SaaS) business.

StreamSets is targeting a “cloud first” type of user with StreamSets Cloud.

“The existing StreamSets DataOps Platform (including StreamSets Data Collector, Control Hub, and Transformer among other products) is designed for enterprises that have a wide variety of data integration use cases and design patterns,” the company says. “On the other hand, StreamSets Cloud is a cloud-native SaaS that is optimized for the needs of cloud-first users who are redefining how data platforms are built and consumed.”

In the cloud, StreamSets users will get a point-and-click data pipeline building experience, without the need to install and maintain execution engines, the company says. The offering will also be tailored to common cloud use cases, such as ingesting data into cloud data warehouse and data lakes. And it will take full advantage of the scalability of cloud hosting environments, the company says.

The cloud is fast becoming where the majority of StreamSets customers are moving data to or from, the company says.

“More and more of our customers are adopting cloud data platforms such as Databricks, Snowflake and Azure, and we have seen that increase steadily in the last few years,” it says. “In particular, the adoption of cloud data warehouses and data lakes is taking off, and many of our customers are migrating from on-premises warehouses and lakes to cloud, or utilizing both for different use cases and synchronizing data across their hybrid environment.”

The beta for StreamSets Cloud will open in the coming weeks. Interested participants can pre-register here.

Related Items:

Can We Stop Doing ETL Yet?

StreamSets Balances Streaming Data Demands for Security, Access

Keeping on Top of Data Drift

Applications: Enterprise Analytics

Technologies: Cloud, Frameworks, Middleware

Sectors: Financial Services, Healthcare, Manufacturing, Retail

Vendors: Databricks, Microsoft, Snowflake, StreamSets

Tags: apache spark, data pipeline, dataops, ETL, StreamSets DataOps

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

StreamSets Eases Spark-ETL Pipeline Development

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

StreamSets Eases Spark-ETL Pipeline Development

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 24, 2024

April 23, 2024

April 22, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link