Kubeflow Emerges for ML Workflow Automation
Many data scientists today find it burdensome to manually execute all of the steps in a machine learning workflow. Moving and transforming data, training models, then promoting them into production – all of it requires the data scientist’s close attention. But now an open source project called Kubeflow promises to eliminate much of that busywork by automating machine learning workflows atop Kubernetes clusters.
Google initially created Kubeflow to manage its internal machine learning pipelines written in Tensorflow and executed atop Kubernetes, and released it as an open source project in late 2017. Since then, the Kubeflow community has integrated the software with a handful of additional machine learning and deep learning frameworks, including MXnet, PyTorch, Caffe2, and Nvidia TensorRT, as well as Jupyter notebooks and MPI, the parallel computing framework used in high performance computing (HPC) clusters.
Kubeflow’s design is based on the concept of a machine learning pipeline. An ML pipeline includes all of the steps that are included in a given data science workflow. It may start with obtaining data from a local or remote source, running some transformation upon the data, loading it into an ML model running on a laptop computer, and then initiating the training of that model on a larger cluster, using one or more data sets.
Many data scientists have tried to alleviate the administrative burden associated with their workflows by writing custom shell scripts or using tools like Chef, Puppet, or Ansible. However, these tools were designed for automating generic administrative tasks, not for streamlining data science and ML workflows.
Kubeflow was specifically designed to simplify the work involved in designing and deploying ML applications by bundling all the steps in an ML process into self-contained pipeline that runs as a Docker container atop the Kubernetes orchestration layer. By leaning heavily on Kubernetes, Kubeflow can provide a higher abstraction level for ML pipelines, thereby freeing up the human data science assets for more value-added activities.
The Kubeflow project is still relatively young, but some folks in the big data community are impressed with what they’ve seen so far. Count Jim Scott, the director of enterprise architecture for MapR Technologies, as a Kubeflow convert.
Scott shares conversations he’s had with data scientists, in which they say, “I have this problem,” he says. “I have 20 steps. I’m doing them manually. This sucks. I can’t keep doing this.”
Getting ML models into production has historically been one of the big sticking points of data science. A data scientist may have developed the world’s greatest linear regression, but unless it’s working on real live data, it will not have an impact. Conversely, putting ML models into production before it’s been thoroughly tested is a great way to alienate customers – and potentially run afoul of regulators too.
With Kubeflow, data scientists essentially get a power tool that can automate a good chunk of that administrative work, ensuring that no steps in the process has been skipped, and that the process can be repeated later on if necessary.
“To go from the first 50 steps in a workflow to the next 50 steps in a workflow…with the orchestration being directly tied into the tooling for the workflow, I think it removes a really big hurdle,” Scott says. “When you’re going between different environments — and a lot of operational workflows encompass a variety of environments — the orchestration tier breaks down the walls of what those environments actually are.”
Much of Kubeflow’s power comes from the fact that it’s already integrated with Kubernetes. As long as the data scientist is working with a tool or framework that’s been designed to work with Kubeflow, the data scientist can rely on Kubeflow to ensure that the work executes on Kubernetes as it was designed to, without manually turning the knobs and dials, as was previously required.
“The more languages and frameworks that are supported by Kubeflow, the easier it is for them to be able to create repeatable processes,” Scott says, “especially the ones that can be built and defined locally on their laptop or desktop, they can also be run in a production environment without having to rebuild the workflow because the environment is different.”
Comparisons with Airflow
“This year, if I was to put a label on the year, I would say this is the year of workflow tools,” Scott says. “Historically, they have sucked, or they’ve just taken a lot of extra effort. They’re difficult to use and manage. But when it’s bound into orchestration software, if your orchestrating your environment [with Kubernetes], you’re now picking up workflow for free.”
The closest competitor to Kubeflow might be Apache Airflow, the open source workflow management tool originally developed by Airbnb. Airflow has become a popular way to coordinate the execution of general IT tasks, including some tasks related to big data management, ML and data science. Airflow also integrates with Kubernetes, providing a potent one-two combination for reducing the technological burden of scripting and executing diverse jobs to run in complex environments.
But Kubeflow’s strict focus on ML pipelines gives it an edge over Airflow for data scientists, Scott says.
“I anticipate that airflow will have similar trajectory and growth as what Kubeflow will have, but with Kubeflow being more on the data scientist type of workflows and Airflow catching everything else,” he says.
Both Airflow and Kubeflow have one thing in common: Improve the relationship with systems admins who control the infrastructure. Those two groups have not always seen eye to eye on things, but workflow tools could help alleviate their differences.
“Now we have a clean line of separation,” Scott says. “The ops guy doesn’t have to understand everything the data science guy is doing, and the data science guy doesn’t have to care about all the nitty gritty deals that the system administrators admins would otherwise be doing. They can focus on their jobs….It gives them a nice clean handoff so they’re not yelling at the other one for giving them poor documentation or not explaining it well enough.”
MapR does not yet support Kubeflow, but that will likely change in the near future. Google unveiled a commercial version of Kubeflow, called Kubeflow Pipelines, in November. Last month Intel released Nauta, which is essentially a commercial implementation of Kubeflow. Kubeflow is an open source project managed on GitHub. You can also find more information about Kubeflow at www.kubeflow.org.