Follow Datanami:
May 1, 2018

Apache Airflow to Power Google’s New Workflow Service

Apache Airflow, the workload management system developed by Airbnb, will power the new workflow service that Google rolled out today. Called Cloud Composer, the new Airflow-based service allows data analysts and application developers to create repeatable data workflows that automate and execute data tasks across heterogeneous systems.

Airbnb engineer Maxime “Max” Beauchemi created Airflow in 2014 to help the company stay on top of common computational tasks, such as extracting data from databases, transforming data for analysis, loading data into analytical environments, applying rules for email marketing campaigns, or kicking off A/B testing. The short-term vacation rental company was growing quickly, and like other companies innovating in the space, it heavily relied on emerging data technologies, like Apache Hadoop and related tech, to help it efficiently process and analyze data at scale.

The engineer found that existing workflow tools, like Apache Oozie and Azkaban, were insufficient for Airbnb’s needs. The company needed to schedule upwards of 6,000 Hadoop jobs per day, and it was becoming increasingly difficult to ensure that the jobs were completed on time and in the right order. What’s more, the use of Hadoop and related frameworks (like Pig, Hive, Spark, PySpark, etc.) was growing at Airbnb, and the company needed a better way to allow analysts and other users to create new data pipelines without taking up data engineers’ precious time.

Beauchemi created Airflow to fill the gap in data pipeline tools. The core Airflow data structure is a directed acyclic graph (DAG) that provides a rich workflow abstraction for ordering the execution of tasks. In addition to scheduling jobs across multiple platforms, the software also tracks jobs’ dependencies, automates deployment and execution, and monitors the dataflows. Instead of a GUI, Beauchemi gave Airflow a command line interface designed to boost the productivity of Python users, although there are graphical views of DAGs and other Airflow concepts for development and monitoring purposes.

Airflow’s source code has been available on GitHub since 2015, and it’s been incubating at the Apache Software Foundation since March 2016. Since then, the project has seen an explosion in interest as other users have found Airflow fills a need as an “air traffic control” solution for the modern big data infrastructure. Companies like Pandora, WePay, Quizlet, and Bloomberg have adopted Airflow or have started working with it.

Now Google is getting into the Airflow act with Google Cloud Platform, its hosted environment. The tech giant is presenting Cloud Composer as a managed Airflow service to simplify the creation and management of workflows in the Google Cloud environment, including built-in integration with BigQuery, Dataflow, Dataproc, Datastore, Cloud Storage, Pub/Sub, and Cloud ML engine. Google says customers can orchestrate their entire GCP pipeline through Cloud Composer.

But Cloud Composer can also be used to manage hybrid data pipelines that touch services and data that live outside of the GCP environment, including other cloud providers. Avoidance of lock-in is a key feature, Google says. “Whether your workflows are on different clouds or on-premises, they can all be orchestrated right from Cloud Composer,” the company says in a video on the Cloud Composer webpage. The company adds that Cloud Composer “gives you the portability to take these critical pieces of infrastructure with you if you migrate your environment.”

DAGs are the key element providing flexibility in Cloud Composer, Google says. “DAGS allow you to see how you’re performing at a glance, dive more deeply into the workflow through a range of charts, or look direly at the code — all while Cloud Composer points out if and where errors are happening,” the company says in its video.

Google says it chose to base Cloud Composer on Airflow despite the wide number of “awesome” open source workflow management tools, such as Oozie, Azkaban, and Luigi. Key reasons for the selection of Airflow were the fact that Airflow has an “active and diverse developer community,” its support for Python, the support of diverse platforms with Airflow operators, support for multi-cloud setups, and lastly the command line and Web interfaces.

Cloud Composer is now in beta, and it’s free to try, but customers who use it in production will be charged for it. Google says pricing for Cloud Composer is consumption based, which means users pay for the number of CPU hours they use, the number of gigabytes of data they store per month, and the number of gigabytes transferred per month. “We have multiple pricing units because Cloud Composer uses several GCP products as building blocks,” the company says. See the company’s Cloud Composer webpage for details.

Related Items:

8 New Big Data Projects To Watch

Weighing Open Source’s Worth for the Future of Big Data