AWS Plots Zero-ETL Connections to Azure and Google
At the recent re:Invent show, AWS unveiled new zero-ETL connections that will eliminate the need for customers to build and maintain data pipelines between various AWS data services, including Redshift, Aurora, DynamoDB, and Open Search. In the future, zero-ETL connections could also be available between AWS services and those running on Microsoft Azure and Google Cloud, an AWS executive says.
ETL (extract, transform, and load) is a fundamental process that’s part of most data analytics projects in the world. ETL exists because companies typically run operational systems and analytical systems on different infrastructure, with different types of databases that are optimized for online transaction processing (OLTP) or online analytical processing (OLAP).
For decades, data engineers have built ETL pipelines that extract the data from the operational database (often a row-oriented database) transform it into a format useable for analytics, and then load it into the analytical warehouse (such as a column-oriented database). ETL pipelines must be built for each operational system that will be contributing data to the analytical project, which can be as little as a handful or as many as 100. Sometimes the order is changed and the transformation (typically the hardest step) is done once the data has been loaded into the target analytical database, in which case it’s called ELT.
There are numerous problems with ETL (and ELT) that make it the bane of many data engineers’ existence. For starters, data pipelines are often brittle. Anytime an application developer makes a change to a field or adds a field to the upstream or downstream database, a data engineer must go in and change the ETL pipeline to account for it. Data can also drift by itself over time, due to the changing nature of the business, and there are many other ways ETL can break.
Despite the vitriol aimed at ETL, the IT world has largely been stuck with it. While the technology for moving data has improved with systems like Apache Kafka, the underlying nature of ETL-based data pipelines has not. Companies that have been at it for decades, like Informatica, IBM, Oracle, and Talend, today have newer competitors like Matillion, Fivetran, Stitch, and Airbyte. There are numerous other ETL vendors touting their slew of connectors, and there’s even reverse ETL.
AWS, which also makes and sells ETL tools like Amazon Glue, touts itself as a customer-focused company. Its executives undoubtedly heard the grumbling and the groaning of customers about large analytics and AI jobs being delayed or perhaps even canceled due to brittle ETL pipelines not being able to deliver the data.
The solution AWS came up with was to get rid of the ETL middleman entirely. The company unveiled its zero-ETL strategy just over a year ago, at re:Invent 2022. The idea was to eliminate the need for customers to build dedicated data pipelines by essentially hardwiring connections between its services.
Its first zero-ETL connection connected data in the MySQL version of Amazon Aurora to Amazon Redshift, its column-oriented data warehouse. That was followed quickly with a zero-ETL connection between Redshift and Apache Spark, the popular big data processing framework that is used in Amazon EMR, Amazon Glue, and Amazon SageMaker.
AWS followed that up with four more zero-ETL connections unveiled at re:Invent 2023. These include connections between Redshift and the Postgres version of Aurora, between Redshift and Amazon DynamoDB, and between Reshift and the Amazon Relational Database Service (Amazon RDS), which is also based on MySQL. The fourth zero-ETL connection is between DynamoDB and Amazon OpenSearch Service, the fork of Elasticsearch offered by AWS.
According to Ganapathy Krishnamoorthy, AWS’s vice president of data lakes and analytics, zero-ETL has the potential to deliver on the unfulfilled promises regarding the democratization of data, which data analytics providers have been making for years and largely failing to deliver for just as long.
“Why is it taking this long? I would say that there is a lot more emphasis on actually making the data accessible today compared to what it was before,” he said. “I think it’s a question of actually prioritizing that’s the thing. Adam [Selipsky, AWS CEO] went up there and said ‘Hey we want to envision a zero-ETL future,’ and aligned the investment to make that happen. It requires you to actually say, hey, we’re going to envision a world where that is not required.”
Krishnamoorthy, who goes by G2, is under no illusions that companies will store all of their data in AWS databases or AWS file systems. He understands that data will exist in silos, in other applications, on the edge, on premise, and even competing clouds. But that won’t prevent AWS from continuing to invest in its zero-ETL goals, he says.
“Our goal is to actually enable customer to reach and manage their data where it is exists,” Krishnamoorthy told Datanami in an interview at re:Invent. “We’re very proud of our services. But we understand that some data is actually going to be on premises, some data is going to be Azure or Google. And that’s okay. We will make zero ETL work for that, too.”
AWS already has data hooks that extend outside of its data centers. It has partnerships with SaaS vendors like Salesforce to enable customers to query data as it sits in the Salesforce applications. It also has a federated query capability that already exists for Google Analytics, he pointed out. So it’s not a stretch to see the AWS zero-ETL extending further into other clouds, he said.
“So, I as a user, can specify ‘Hey, I need this Google Analytics data accessible for my analytics,’ and then the machinery kicks in and makes sure that you don’t have to write the ETL. The same thing for data that exists in BigQuery,” Krishnamoorthy says. “This journey that we’re actually on, that helps you get easy access from your favorite tool. It could be Athena, it could be Quicksight, for all of your data [which] is actually something that we are deeply committed to. And we are actually supplying the best solution today and we are looking to improve on that.”
The actual mechanism that would enable this level of zero-ETL integration isn’t clear. Krishnamoorthy says it could be connectors or it could be some more direct connection, such a change data capture (CDC) directly into the change log of a database, or some other approach. Whatever the mechanism turns out to be, the important thing, he said, is that users don’t have to worry about it.
“It actually comes down to data,” he said. “If you think about it, you really need to have friction-free access with the right governance on all of your data in your enterprise systems. That’s the difference. You have powerful tools that are coming in in terms query understanding, in terms of query translation. But it all goes down to actually access to the data. This is why zero-ETL is such a foundation. It actually reduces the amount of pain that is involved in bringing all the data accessible to all of your tools.”