Airflow Available as a New Managed Service Called Astro
Companies can now get an Apache Airflow data orchestration environment up and running in less than an hour via Astro, a new managed service launched today by Astronomer, the commercial entity behind the popular open-source project for data pipelines.
Airflow has become one of the most popular open-source projects in the world, thanks to its ability to create and orchestrate large numbers of data pipelines in a flexible manner. These pipelines are expressed as directed acyclic graphs (DAGs), which can be defined graphically or written directly in Python code, and can do any number of tasks, including moving data according to a schedule or as a trigger to an event or action.
While Airflow is one of the more successful open source projects, it’s not necessarily easy to set up new environment. There are about 120 different configuration options that users must manually set when they stand up a new Airflow cluster, according to Ryan Fox, vice president of product for Astronomer.
“To get a new Airflow environment up, even if I had already done it previously, just to get another copy of one up , it would take me two days,” Fox says. “And that’s for someone who had experience with it, already had the scripts built.”
With the new Astro managed service environment that Astronomer launched today on all major cloud platforms, that figure drops to about five minutes, Fox says. “We can have customers up and running in under an hour, and then from there new Airflow environments are up in running in minutes,” he tells Datanami.
Astro should also help simplify management of large Airflow environment, which can get complex with tens or even hundreds of thousands of individual DAGs and thousands of Airflow environments.
“We know that when those systems start to get distributed, that you can have increased data outages and lower data quality without explicit investment,” Fox says. “It really is a product that makes Airflow approachable to data engineers as well as analysts and scientists working with the tools that they know and love best.”
The path that Airflow took from being a promising open source project to an enterprise-grade data orchestration service is an interesting one. The software was originally created at Airbnb in 2014 to help orchestrate the plethora of data pipelines that the company’s data engineers, data scientists, and data analysts were creating to move data.
At the time, there were several data orchestration tools in the market, but no clear winner. Oozie was popular among companies that had adopted Hadoop, while Luigi was created at Spotify. Other Web giants, like LinkedIn, created their own internal products.
But by 2018, the market had coalesced around Airflow, which was developed in Python and allows developers to work in Python if they like. “Airflow was sort of everywhere,” says Joseph Otto, Astronomer’s CEO.
“We saw this this repeating group of projects in most companies that were building modern data platforms. And that was really Spark, Kafka, and Airflow,” he says. “They were 95% of the companies we were running into.”
Astronomer was founded in 2018 to help nurture the open source project. At that time, Airflow had a lot of users, but the project didn’t have a commercial steward, Otto says. The Cincinnati, Ohio company has raised about $300 million and has about 300 employees, most of whom are developers or engineers working under Laurent Paris, the company’s senior vice president of R&D.
“We went out and found all the committers for the project, the release managers, the people that were driving the project, and offered them a chance to join a company and to make Airflow their full-time job,” Otto says.
Today, Airflow is downloaded about 9 million times per month, and the project has more contributors than Apache Spark. There are 250,000 companies using Airflow around the world, according to Otto, which is a huge built-in customer base for Astronomer, which has about 450 paying customers currently.
Over the past 18 to 20 months, Astronomer has rewritten about 90% of the internals of Airflow, according to Otto. It re-developed Airflow into a container-based application that runs in a Kubernetes environment, which was done explicitly with the goal of enabling it to run as a service in the cloud.
“We basically transformed the project,” Otto says. “So that two-year investment in the product yielded us a preferred and very privileged view of how people were using Airflow and it was just amazing how deeply embedded it was in large banks, in retail, and manufacturing – you name it – around the world.”