Dagster Emerges to Simplify Data App Development
If you’re frustrated with the fragmented way you build ETL processes or machine learning pipelines, then you might be interested in learning about Dagster, a new open source library developed to provide a common abstraction layer that helps data scientists, analysts, and engineers to create robust data applications in their tools of choice.
“Our data is totally broken.” That sums up the general attitude that many data scientists and data engineers have toward the state of data applications, according to Nick Schrock, who co-developed GraphQL while working as an engineer at Facebook, and who recently founded Elementl.
“My immediate reaction was confusion: How does one break data?” Schrock recounts in a Medium post last week in which he unveiled Dagster. “I quickly came to realize that it wasn’t a technical or engineering problem statement. Instead, it was an instinctive recognition that something is wrong at a systemic level.”
While data management problems aren’t sexy, they nevertheless post a real hurdle to companies that are determined to fully utilize their data for big data analytics and machine learning use cases, according to Schrock. What’s more, the commonly cited statistic that data scientists spend 80% of their time with data cleaning is actually just a rough reflection of the overall level of despair shared by data engineers and analysts alike, he writes.
“It is difficult for leadership to get engineers to work on data management problems because they aren’t considered glamorous,” Schrock writes. “Further compounding the problem, engineers and non-engineers who do engage report that they feel as if they waste most of their time.”
Motivated to find a solution, Schrock determined that that what’s really missing from the modern data puzzle is a common unifying layer that allows the various ETL/ELT scripts, machine learning pipelines, and other forms of data applications to talk to one another.
Currently, data applications are typically developed with a wide range of tools (Spark, SQL, Scala, Python, etc.) and when they need to communicate or inter-operate, “massive amounts of metadata and context is often lost as data is flowed from tool to tool, and there is no standard for interacting with the computations crafted within those tools,” Schrock writes.
The lack of this unifying layer causes all sorts of problems, including the repetitive janitorial-level data preparation tasks described above, but also other downstream problems, like an inability to reliably test data applications before putting them into production. If these ETL and data pipeline workflows had a more solid footing to be based upon, then many of the problems would be more easily addressed.
Schrock and company decided to address the data dilemma by creating Dagster.
Enter the Dagster
At a high level, Dagster is an integration layer designed to fill the gap left when data teams use a variety of tools and language to create data applications, which in turn must adapt to a variety of unexpected inputs. The open source software exposes a Python API that represents a standard way to ensure these applications can interoperate.
Dagster, Schrock writes, is:
“A layer that models and describes the semantics and meaning of an application’s computations and produced data assets, rather than just the scheduling and execution of those computations.”
Dagster enables developers to define abstract graphs of functional computations, called “solids.” The actual business logic of solids is defined and executed in any that’s structured to conform to the Dagster API. “Solids’ inputs and outputs are connected to each by data dependencies, which form the graph,” Schrock writes.
Once the underlying computations have been structured to work with Dagster, they are accessible via an API, he writes. This API can be used to access the functions of the underlying computation, and also provides an integration for DevOps tools.
Creating testable applications is one of the focuses of Dagster. “Data applications are notoriously difficult to test and are therefore typically un- or under-tested,” Schrock writes. “The reality of most existing code is that it is deeply coupled to its operating environment and difficult to test in isolation and reused in other contexts.”
The computationally intensive nature of data applications also throws developers another curveball. The combination of this nature plus the testing challenge described above creates an “extraordinarily slow developer feedback loops” that are measured in hours.
“This not only slows down business logic development,” Schrock writes, “but also makes it expensive and risky to restructure and refactor code. These factors compound, resulting in software that is difficult to test; expensive and risky to change; and, as a result, often have low code quality.”
Developers can create data applications with Dagter using their choice of tool. Data applications that are created with Dagster are queryable and can be monitored through an API. Instead of being constructed with execution dependencies, Dagster apps have data dependencies, Schrock writes.
Schrock and company have been working on Dagster for the past year. Last week, Elementl released Dagit, which is a developer tool for local Dagster development. The development environment supports Spark, PySpark, Pandas, and Jupyter development tools; AWS and Google Cloud runtimes; Dask and Apache Airflow workflow tools; and ops tools like DataDog and PagerDuty, the company says.
The folks at Elementl don’t expect Dagster to solve all of the problems of the data community overnight. However, they do foresee that it could provide a common abstraction upon which the community can accurate the creation of robust data applications in the future.
“We believe that adopting Dagster will immediately improve productivity, testability, reliability, and collaboration in data applications,” Schrock writes. “If broadly successful, it will lead to an entirely new open ecosystem of reusable data components and shared tooling.”