Follow Datanami:
December 19, 2023

How Airflow 2.8 Makes Building and Running Data Pipelines Easier

(posteriori/Shutterstock)

Apache Airflow is one of the world’s most popular open source tools for building and managing data pipelines, with around 16 million downloads per month. Those users will see several compelling new features that help them move data quickly and accurately with version 2.8, which was released Monday by the Apache Software Foundation.

Apache Airflow was originally created by Airbnb in 2014 to be a workflow management platform for data engineering. Since becoming a top-level project at the Apache Software Foundation in 2019, it has emerged as a core part of a stack of open source data tools, along with projects like Apache Spark, Ray, dbt, and Apache Kafka.

The project’s strongest asset is its flexibility, as it allows Python developers to create data pipelines as directed acyclic graphs (DAGs) that accomplish a range of tasks across 1,500 data sources and sinks. However, all that flexibility in Airflow sometimes comes at the cost of increased complexity. Configuring new data pipelines previously required developers to have a level of familiarity with the product, and to know, for example, exactly which operators to use to accomplish a specific task.

With version 2.8, data pipeline connections to object stores become much simpler to build thanks to the new Airflow ObjectStore, which implements an abstraction layer atop the DAGs. Julian LaNeve, CTO of Astronomer, the commercial entity behind the open source project, explains:

“Before 2.8, if you wanted to write a file to your S3 versus Azure BLOB storage versus on your local file disk, you were using different providers in Airflow, specific integrations, and that meant that the code looks different,” LaNeve says. “That wasn’t the right level of abstraction. This ObjectStore is starting to change that.

“Instead of writing custom code to go interact with AWS S3 or GCS or Microsoft Azure BLOB Storage, the code looks the same,” he continues. “You import this ObjectStorage module that’s given to you by Airflow, and you can treat it like a normal file. So you can copy it places, you can list files and directories, you can write to it, and you can read from it.”

Airflow has never been super opinionated about how developers ought to build their data pipelines, which is a product of its historic flexibility, LaNeve says. With the ObjectStore in 2.8, the product is starting to offer an easier path to build data pipelines, but without the added complexity.

“It also fixes this paradigm in Airflow that we call transfer operators,” LeNeve says. “So there’s an operator, or pre built task, to take data from S3 to Snowflake. There’s a separate one to take data from S3 to Redshift. There’s a separate one to take data from GCS to Redshift. So you kind of have to understand where Airflow does and where Airflow does not support those things, and you end up with this many-to-many pattern, where the number of transfer operators, or prebuilt tasks in Airflow, becomes very large because there’s no abstraction to this.”

With the ObjectStore, you don’t have to know the name of the exact operator you want to use or configure it. You just tell Airflow that you want to move data from point A to point B, and the product will figure out how to do it. “It just makes that process much easier,” LeNeve says. “Adding this abstraction we think will help quite a bit.”

Airflow 2.8 is also bringing new features that will heighten data awareness. Specifically, a new listener hook in Airflow allows users to get alerts or run custom code whenever a certain dataset is updated or changed.

“For example, if an administrator wants to be alerted or notified whenever your data sets are changing or the dependencies on them are changing, you can now set that up,” LaNeve tells Datanami. “You write one piece of custom code to send that alert to you, how you’d like it to, and Airflow can now run that code basically whenever those data sets change.”

The dependencies in data pipelines can get pretty complex, and administrators can easily get overwhelmed by trying to manually track them. With the automated alerts generated by the new listener hook in Airflow 2.8, admins can start to push back on the complexity by building data awareness into the product itself.

“One use case for example that we think will get a lot of use is, anytime a data set has changed, send me a Slack message., That way, you build up a feed of who’s modifying data sets and what do those changes looking like,” LaNeve says. “Some of our customers will run hundreds of deployments, tens of thousands of pipelines, so to understand all of those dependencies and make sure that you are aware of changes to those dependencies that you care about, it can be pretty complex. This makes it a lot easier to do.”

The last of the big three new features in Airflow 2.8 is an enhancement to how the product generates and stores logs used for debugging problems in the data pipelines.

Airflow is itself a complicated bit of software that relies on a collection of six or seven underlying components, including a database, a scheduler, worker nodes, and more. That’s one of the reasons that uptake of Astronomer’s hosted SaaS version of Airflow, called Astro, has increased by 200% since over the past year (although it still sells enterprise software that customers can insatll and run on-prem).

“Previously, each of those six or seven components would write logs to different locations,” LaNeve explains. “That means that, if you’re running a task, you’ll see those task logs that are specific to the worker, but sometimes that task will fail for reasons outside of that worker.  Maybe something happened in the scheduler or the database.

“And so we’ve added the ability to forward the log from those other components to your task,” he continues, “so that if your task fails, when you’re debugging it, instead of looking at six or seven different types of logs…you can now just go to one place and see everything that could be relevant.”

These three features, and more, are generally available now in Airflow version 2.8. They’re also available in Astro and the enterprise version of Airflow sold by Astronomer. For more information, check out this blog on Airflow 2.8 by Kenten Danas, Astronomer’s manager of developer relations.

Related Items:

Airflow Available as a New Managed Service Called Astro

Apache Airflow to Power Google’s New Workflow Service

8 New Big Data Projects To Watch

 

Datanami