Follow Datanami:
March 15, 2021

Informatica Accelerates DataOps with Spark, GPUs

Informatica today announced that customers can see up to a 5x performance boost for ETL and data management workloads when they run them under its new cloud-based data integration engine that’s powered by Apache Spark and Nvidia GPUs.

Informatica’s new offering, called Cloud Data Integration, is a hosted service designed to enable users to execute a slew of data operations (DataOps) and management tasks, including collecting data, building ETL pipelines, cleaning data, and preparing it for downstream analysis and machine learning tasks.

The offering, which runs as a serverless cloud service, utilizes Apache Spark as the underling computational engine. Informatica also uses Nvidia’s RAPIDS Accelerator, which enables the Spark code to run atop Nvidia GPUs.

The combination of the Spark code and the GPUs resulted in a significant speedup as well as cost savings. According to Informatica, Cloud Data Integration runs 5 times faster than similar offerings, with 72% lower total cost of ownership (TCO).

No sophisticated Spark skills are needed to use the new service, Informatica says. Users can work in a “simple drag-and-drop GUI-based development experience” that converts “simple mappings to sophisticated Spark code that can execute on GPUs at scale,” the Silicon Valley firm says in a press release.

It’s all about data democratization, which is “the holy grail of digital transformation initiatives,” according to Jitesh Ghai, Informatica’s chief product officer. “You can’t leverage the power of data and gain valuable insights if you are restricted in your data access,” Ghai says in a press release. “Our collaboration with NVIDIA is valuable to us in bringing enterprise-scale data democratization and narrowing the gap between the data-haves and the data-have-nots within the enterprise.”

Informatica says its Cloud Data Integration offering supports more than 3,000 metadata-aware connectors for an array of file types, including JSON, XML, logs, and clickstream data. The offering supports ETL and ELT workloads, and features more than 100 prebuilt function templates for common data mappings and transformations.

Cloud Data Integration run in elastic Kubernetes clusters on AWS, Azure, and Google Cloud. It also supports real-time change data capture (CDC) functions, enabling it to extract data from production databases running on Windows, Linux, Unix, and IBM i systems. It also supports pushdown optimization that converts workloads to optimized SQL code for popular cloud data warehouses. For more info, see

Related Items:

Informatica Likes Its Chances in the Cloud

Running Sideline to Sideline with Big Data

Can We Stop Doing ETL Yet?