Follow Datanami:
November 30, 2022

AWS Seeks an End to ETL


Extract, transform, and load. It’s a simple and ubiquitous thing in IT. And yet everybody seems to hate it. The latest company to pile on to ETL is AWS, which declared an effort to end ETL yesterday at re:Invent.

Adam Selipsky, the CEO of Amazon’s web services division, discussed the everlasting blight that is ETL during his re:Invent keynote Tuesday morning.

“Combining data from different data sources and different types of tools brings up a phrase that strikes dread into the hearts of even the sturdiest of data engineering teams. That’s right, I’m talking about ETL,” Selipsky said. “Just a few weeks ago, we got an email from a customer discussing ETL, and he literally used this phrase: ‘Thankless, unsustainable black hole.’”

Despite the pain and suffering that ETL has brought upon the computing world, few alternatives have appeared. Many companies have switched the “transformation” and “load” components by conducting the transformation stage in the cloud data lake, data warehouse, or data lakehouse, which has given rise to ELT. But it hasn’t changed the fundamental problem with ETL.

“The manual effort, complexity and undifferentiated heavy lifting involved in building ETL pipelines is painful,” Selipsky continued. “It requires writing … custom code. Then you have to deploy and manage the infrastructure to make sure the pipeline scales. Still, it can be days before the data is ready. And all the while, you’ve got eager analyst pinging you again and again to see if their data is available. And if, when something changes, you get to do it all over again.”

AWS customer quote on the sheer joy and hapiness of managing ETL data pipelines

Sound familiar? If ETL isn’t the bane of data engineering, it’s hard to say what is.

There have been various approaches to deal with ETL over the years. One of the most popular techniques is to just leave the data where it is and push the queries over the wire. AWS has enabled this federated query approach with its analytics and machine learning tools.

“We’ve integrated SageMaker with Redshift and Aurora to enable anyone with SQL skills to operate machine learning models to make predictions, also without having to move data around,” Selipsky said. “These integrations eliminate the need to move data around for some important use cases.

“But what if we could do more? What if we could eliminate ETL entirely?” the CEO said. “This is our vision, what we’re calling a zero-ETL future. And in this future, data integration is no longer a manual effort.”

To that end, AWS unveiled two new solutions that it claims helps eliminate the need for ETL with Redshift, the company’s flagship big data analytics database.

The first zero-ETL solution is a new integration between Aurora and Redshift. According to AWS, companies, once transactional data is available in Aurora, it is “continuously replicated seconds” to Redshift, where it is available within seconds. Aurora, of course, is AWS’s relational database offering that’s compatible with PostgreSQL and MySQL.

“This integration brings together transactional data with analytics capabilities, eliminating all the work building and managing custom data pipelines between Aurora and Redshift,” Selipsky said. “It’s incredibly easy. You just choose the Aurora tables, combining the intended data that you want in order to get that into Redshift. It appears in seconds. If it comes into Aurora, seconds later the data is seamlessly made available inside Redshift.”

The new feature also gives customers the ability to move data from multiple Aurora databases into a single Redshift instance, AWS said. The new data integration is serverless, the company said, and scales up and down automatically based on data volume.

The second new capability in AWS’s “zero-ETL” future involves a new integration between Redshift and Apache Spark, the popular big data processing framework that is used in Amazon EMR, Amazon Glue, and Amazon SageMaker.

Customers often want to analyze data in Redshift using these Spark-based services, but previously that required either manually moving the data, building an ETL pipeline, or obtaining and implementing certified data connectors that could facilitate that data movement.

With the new Redshift integration for Spark unveiled by AWS this week, there is no longer a need to obtain those third-party connectors. Instead, the integration is built into the products.

“Now it’s incredibly easy to run Apache Spark applications on Redshift data from AWS analytics services,” Selipsky said. “You can do a simple Spark job on Jupyter notebooks in AWS services like EMR, Glue, and SageMaker to connect to Redshift to run read/write queries against Redshift tables. No more need to move any data. No need to build or manage any connections.”

Both of these Redshift integrations, for Aurora and Spark, make it easier to generate insights without having to build ETL pipelines or manually move data, Selipsky said. “These are two more steps forward towards our zero-ETL vision,” he said. “We’re going to keep on innovating here and finding new ways to make it easier for you to access and analyze data across all of your data stores.”

Related Items:

AWS Unleashes the DataZone

AWS Introduces a Flurry of New EC2 Instances at re:Invent

Can We Stop Doing ETL Yet?