Follow Datanami:
March 31, 2021

ETL vs ELT – What is the Difference?

Kalyan Arangam

ELT? Is that a typing mistake? That’s a typical reaction from many data professionals familiar with the acronym ‘ETL’ when they first encounter the subtly-different acronym ‘ELT’.

What we’ve traditionally referred to as ETL tools (for extract, load, and transform) run in on-premises data warehouses located on special-purpose, high-end hardware in an organization’s data center.

Since 2013, however, cloud data warehouses have provided nearly infinitely scalable compute power. These resources let businesses forgo pre-load transformations and replicate raw data into their data warehouses – Extracting and Loading first. Then they can Transform data as needed, in the cloud, using SQL. With modern tools, ETL, in effect, becomes ELT. However, many people still use “ETL” to refer to either kind of data tool.

So no, “ELT” is not a typo. It’s those same familiar activities of Extract, Transform and Load—but carried out in a different, and more efficient, order. The new order leverages modern cloud data warehouses – SnowflakeAmazon RedshiftGoogle BigQuery, and Azure Synapse –  as well as the lakehouse technology used in Delta Lake on Databricks. If you want to use more of your data to make better, faster business decisions, ELT in the cloud is the way to go.

Let’s take a look at why.

What is ETL? (Extract, Transform and Load)

First, let’s remind ourselves about exactly what ETL is, and why we need it. ETL is the traditional method for Extracting data from numerous source platforms, Transforming the data on your ETL server and then Loading transformed data into a data warehouse, where it’s ready for analytics and visualization.

The ‘Extract’ part can be relatively simple, if it’s a single data source such as an ERP database. But if it’s an ERP database plus a number of lines of business systems or external third-party data sources, it gets more complicated quickly.  Whether you have one source or many, you need the  right connectors to extract the data from your source..

The “Transform” step can also vary in complexity. . It can be a straightforward denormalization of the normalized data contained in an ERP or line of business database. Or, it can be a more involved task, such as converting units of measure to a common basis, or harmonizing customer names.

Finally, there’s the ‘Load’ task: we have the data, we’ve transformed it, and now it goes into the data warehouse.

Whether ETL is straightforward or more complicated, it’s  characterized by a lot of I/O activity, a lot of string processing and variable transformation, and a lot of data parsing.

In other words, it’s very compute-intensive.

What is ELT? (Extract, Load and Transform)

ELT involves the same activities as ETL, but in this case, the ‘E’ and ‘L’ portions of ELT are done in one move straight to the target data platform. ‘Extract’ works  same with ELT or ETL. ‘Load’ is the same, too, except in ELT you load data into your target before you transform it. But the ‘Transform’ activity is very different. Instead of transforming your data in your ETL engine/server, you use the power of your cloud data warehouse to process the raw data you extracted and loaded.

ELT is also very compute-intensive. But that compute-intensive activity occurs in a highly powerful and scalable environment –  the cloud, rather than in an on-premises server that perhaps needs to trade off between data transformation  other  transaction handling.

Cloud data warehouses like Snowflake, Redshift, Google BigQuery, and Azure Synapse are columnar databases, so index and record location operations are vastly quicker. And they’re also massively parallel databases, so the required transformations are carried out in parallel, not sequentially, with multiple nodes handling multiple transformations at the same time.

So you’re doing compute-intensive activity where it makes sense, in aquicker and more efficient way.

With ELT, you ‘Extract’ data from the source system and stream it to intermediate cloud storage, such as Amazon S3 or Google Cloud Storage before ‘Loading’ it into the target data platform (a cloud data warehouse like Snowflake, or Delta Lake on Databricks). Extracted data passes through but is never persisted to disk. You facilitate transformations by generating and executing appropriate SQL on the target MPP database.

ETL vs ELT: Pros & Cons

ETL

The ETL engine is a compute resource, and as such needs to be powerful enough to handle large amounts of data to be transformed. Often “powerful” also means expensive!

As you would probably expect there are some limitations with the traditional ETL workflow. Namely, the environments running ETL software are not built to scale in the same way that  cloud data warehouses do today. Therefore, when data volumes increase and workloads become more complex these traditional environments consume more IT resources, creating  bottlenecks in the data chain that can negatively impact your reporting and analytics. The worst outcome? Adaptability losses and missed opportunities caused by business decisions that are made with outdated, incomplete, or inaccurate data.

However, it is not all necessarily doom and gloom. If your data is predictable, coming from only a small number of sources and your transformations are minimal,  ETL and an on-premises infrastructure can still be a legitimate cost-effective strategy. However, for most modern companies, this is becoming the exception rather than the norm. ELT

ELT, as opposed to ETL, extracts data from source systems, loads it in its raw form into a target platform, and then allows you to transform it in-database. This makes all data accessible and easy to audit.

ELT leverages the power of the data warehousing platform itself to perform transformations, and get the data into an analytics-ready format. ELT pushes down all transformations to the data warehouse itself, and means that you only need one powerful piece of infrastructure. Further advancements on relational databases make transforming your data in-database easier, faster, and more cost efficient. This design results in savings on infrastructure, better performing workloads, and shorter development cycles. Your data is quickly migrated and immediately available for transformations and analysis based on current business questions and needs.

Also it means you don’t need to know how you are going to use that data from the start. You have the freedom to transform at a later stage once its use case becomes more clear. This ability is increasingly appealing given the changing nature of development with the rise of iterative Agile methodologies. Thus ELT may be more aligned with the current technical mindset.

So ELT is the Best Option?

The answer to this question is not a simple ‘Yes’ or ‘No’. ELT has some obstacles that you shouldn’t ignore.

Since your transformations are being done in-database, you will need available space and compute power to perform the desired transformations. Without an adequate platform, performance and queries will suffer. And getting this platform in a traditional architecture gets really expensive, really quickly. Cloud data platforms such as Snowflake, Amazon Redshift, Azure Synapse, Google BigQuery, and Delta Lake not only have adequate power and scalability, but they are also extremely cost efficient compared with high-performance on-premises architectures.

Another problem we often hear about is the timely and labor-intensive process of script writing. This is not just for loading data, but transforming it. It applies especially when dealing with unusual or unstructured data types, or where access is not simply file-based. More data, data sources, and needed outputs can make code-heavy transformations  complex and vulnerable to human error. It could take days or weeks to script, run, and rollback in instances where mistakes are made. Luckily, there are a number of tools on the market that make this process quicker and easier, such as Matillion.

ELT versus ETL: The Bottom Line

With ELT, you get the advantage of your powerful database twice over: during transformations, and during processing queries. Your choice to use ETL vs. ELT should be based on your target platform. ETL workflows tend to fire DDL (inserts/updates/deletes) for each processed row. This may work fine with traditional databases but not as well with modern MPP platforms which are read-optimized but usually a bit slow with writes. ELT is arguably a better choice when dealing with modern MPP platforms because of the performance gains it offers.

Additionally, ELT typically offers a lower cost of entry. Putting compute-intensive activities in Big Iron has always been seen as sensible—for those that could afford the price of a hefty mainframe or high-end server. But today, that Big Iron is accessible to any business with the price of an affordable, monthly subscription on the Cloud. Likewise, massively parallel columnar databases have been around for years. But those offerings have been on-premises, and expensive, with an entry point denominated in hundreds of thousands of dollars. Today, yet again, cloud data warehouses make MPPs accessible to any business with the price of a low cost monthly subscription.

While ELT is not ALWAYS the answer, you will probably get better value for money and performance if you do use ELT and the cloud.

Matillion ETL – or is it ELT?

At Matillion, we think ELT makes compelling sense. Matillion ETL is a data integration tool that we purpose built for Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse, and Delta Lake on Databricks. While we refer to the product as Matillion ETL since “ETL” is more commonly known, Matillion is actually an ELT product. Following an ELT approach Matillion loads source data directly into your database allowing you to transform and prepare data for analytics using the power of your cloud data architecture.

Datanami