How to Manage ML Workflows Like a Netflix Data Scientist
Data scientists want to do data science. It’s right there in the title, after all. But data scientists often are asked to do other things besides building machine learning models, such as creating data pipelines and provisioning compute resources for ML training. That’s not a good way to keep data scientists happy.
To keep Netflix’s 300 data scientists happy and productive, in 2017 the team leader for the Netflix machine learning infrastructure team, Savin Goyal, led development of a new framework that abstracts away some of these less data science-y activities and allow them to focus more of their time on data science. The framework, called Metaflow, was released by Netflix as an open source project in 2019, and has been widely adopted since.
Goyal recently sat down with Datanami to talk about why he created Metaflow, what it does, and what customers can expect from an enterprise version of Metaflow at his startup, Outerbounds.
“If you want to do machine learning, there are a few raw ingredients that are needed,” Goyal says. “You need some place to store and manage your data, some way to access that data. You need some place to orchestrate compute for your machine learning models. Then there are all the concerns around MLops–versioning, experiment tracking, model deployment and whatnot. And then you need to figure out how is the data scientist actually authoring their work.
“So all of these raw building blocks in many ways exist, but then it’s still sort of left up to a data scientist to cross the barriers from one tool to the other,” he continues. “So that’s where Metaflow comes in.”
Metaflow helps by standardizing many of these processes and tasks, thereby allowing a data scientist to focus more on machine learning activities using Python or R, or any framework. In other words, it makes the data scientist “full stack,” Goyal says.
“It’s really important for a company like Netflix to provide a common platform for all its data scientists to be a lot more productive,” he says. “Because one of the big issues that bites most enterprises is that, if you have to deliver value with machine learning, not only do you need people who are great with data science, but you also need to enable folks to navigate internal system complexity.”
With Metaflow in place, Netflix’s data scientists (they called themselves machine learning engineers) don’t have to worry about how to connect to various internal data sources, or how to get access to large compute instances, Goyal says. Metaflow automates many of the aspects of running training pipelines or inference pipelines at scale on Netflix’s cloud platform, which is AWS.
In addition to automating access to data and compute, Metaflow also provides MLOps capabilities that help data scientists document their work through code snapshots and other capabilities. According to Goyal, the ability to reproduce results is one of the big benefits provided by the framework.
“One of the things that lacks in machine learning traditionally is problem reproducibility,” he says. “Let’s say you’re a data scientist and you’re training a model. Oftentimes, nobody else would be able to reproduce your results. Or if you’re running into an issue, nobody would be able to reproduce that issue to debug that. So we basically provide guarantees around reproducibility that then enables people to share results within teams, and that fosters collaboration.”
Metaflow also lets users mix and match different cloud instance types within a given ML workflow, which helps them to reduce costs, Goyal says. For example, say a data scientist wants to train an ML model using a large data set housed in Snowflake. Since it’s a lot of data, it needs to be go through a memory-intensive analysis process first, he says. Then the data scientist may want to train models on GPUs. Following the compute-intensive training process, deploying the model for inference requires fewer resources.
“You can basically carve out different sets of your workflow to run on different instance types and different resources,” Goyal says. “That further lowers your overall cost of training a machine learning model. You do not want to pay for a GPU instance while you’re only doing something that’s memory-intensive and doesn’t really engage the GPU.”
Metaflow allows data scientists to use their development tool or framework of choice. They can continue to run TensorFlow, PyTorch, scikitlearn, XGBoost, or any other ML framework they want. While there is a GUI for Metaflow, the primary method of interaction with the product is by including decorators in their Python or R code. At runtime, the decorators determine how the code with execute, Goyal says.
“We are basically targeting people who know data science,” he says. “They don’t want to be taught data science. They’re not looking for a no-code, low-code solution. They’re looking for a solution that firmly puts them in control while abstracting away all infrastructure concerns. That’s where Metaflow comes in.
Since Netflix initially released Metaflow to open source in 2019, it’s been widely adopted by hundreds of companies. According to the project’s website, it’s being used by Goldman Sachs, Autodesk, Amazon, S&P Global. Dyson, Intel, Zillow, Merck, Warner Media, and Draft Kings. CNN, another user, reports seeing an 8x performance boost in terms of the number of models they put into production over time, Goyal says.
The open source project on GitHub has 7,000 stars, putting it in the top one or two projects in this space, he says. The Slack channel is quite busy, with about 3,000 active members, he says. Since being released initially for AWS, Metaflow has since been adapted to work on Microsoft Azure and Google Cloud, as well as Kubernetes, Goyal says. It also has been used with hosted clouds from Oracle and Dell.
In 2021, Goyal co-founded Outerbounds with Ville Tuulos, a former Netflix colleague, and Oleg Avdeev, who hailed from MLOps vendor Tecton. Goyal and his team at the San Francsico-based company continue to be the primary developers of the open source Metaflow project. Four months ago, Outerbounds launched a hosted version of Metaflow that allows users to get up and running very quickly on AWS.
Because Outerbounds is controlling how the infrastructure is deployed, it’s able to offer guarantees around security, performance, and fault-tolerance with its managed offering compared to the open-source version, Goyal says.
“In open source, we have to make sure that our offering works for every single user out there who wants to use us. In certain specific areas, that’s sort of easier said than done,” he says. “If you’re using our managed offering then we can afford to take certain very specific opinions” about the deployment.
Reducing cloud spending is a big focus for the Outerbounds offering, particularly with the scarcity and expense of GPUs these days. Eventually, the company plans to enable customers to tap into the power of GPUs residing on-prem, provided there is some sort of connection to the hyperscaler of choice.
“As you start scaling machine learning models, things start getting expensive quite quickly and there are a lot of mechanisms to lower that cost quite a bit,” Goyal says. “Do I know that my GPUs are not data starved, for example. How can we make sure that you’re able to move data with the highest throughput possible? That’s more than cloud providers would provide you.”
December 8, 2023
- Fortanix Highlights Encryption’s Growing Role in Data Security
- Intel Showcases AI Prowess: Stability AI Selects Intel for Enterprise AI Compute Needs
- Dell Generative AI Open Ecosystem with AMD Instinct Accelerators
- CloudFabrix Launches Data Fabric for Observability Platforms at Cisco Live 2023 Melbourne
- Yurts Secures $16M Contract with SOCOM to Integrate LLMs in Defense Enterprises
December 7, 2023
- VictoriaMetrics Unveils Free Trial of Its Enterprise Solution for Enhanced Monitoring and Observability
- Kinetica Unveils 1st SQL-GPT for Telecom, Transforming Natural Language into SQL Fine-Tuned for the Telco Industry
- Supermicro Extends AI and GPU Rack Scale Solutions with Support for AMD Instinct MI300 Series Accelerators
- Dell Technologies Boosts AI Performance with Advanced Data Storage and NVIDIA DGX SuperPOD Integration
- Intel Labs to Present New AI Research at NeurIPS 2023
- VAST Data Closes Series E Funding Round, Nearly Triples Valuation to $9.1B
- Sprinklr Empowers Businesses to Deploy and Scale Generative AI-powered Conversational Bots
- KNIME Releases Improved UI, Enhanced AI Assistant, Modernized Scripting Experience with AI, and More
- EY Report Highlights: Generational Divide in AI Adoption and Perception in the Workforce
- Bigeye Receives Strategic Investment from Alteryx Ventures
December 6, 2023
- Astronomer Unveils Latest Astro Release with Advanced Security and Cost-Savings Features
- Asato Secures $7.5M Investment to Support Development of AI Copilot Platform
- AMD Instinct MI300 Series Launch: Accelerating Next-Gen AI and Supercomputing
- SQream Achieves SOC-2 Type II Compliance Certification for Its Cloud-Native Data Lakehouse ‘Blue’
- Ataccama Announces ONE AI for Improved Automated Data Governance
Most Read Features
- Databricks Bucks the Herd with Dolly, a Slim New LLM You Can Train Yourself
- Big Data File Formats Demystified
- Altman’s Back As Questions Swirl Around Project Q-Star
- Data Mesh Vs. Data Fabric: Understanding the Differences
- Quantum Computing and AI: A Leap Forward or a Distant Dream?
- AWS Adds Vector Capabilities to More Databases
- Patterns of Progress: Andrew Ng Eyes a Revolution in Computer Vision
- Taking GenAI from Good to Great: Retrieval-Augmented Generation and Real-Time Data
- Five AWS Predictions as re:Invent 2023 Kicks Off
- Why Samsara Picked Ray to Train AI Dashcams
- More Features…
Most Read News In Brief
- Mathematica Helps Crack Zodiac Killer’s Code
- Databricks: We’re a Data Intelligence Platform Now
- GenAI Debuts Atop Gartner’s 2023 Hype Cycle
- Pandas on GPU Runs 150x Faster, Nvidia Says
- Retool’s State of AI Report Highlights the Rise of Vector Databases
- Amazon Launches AI Assistant, Amazon Q
- AWS Launches High-Speed Amazon S3 Express One Zone
- New Data Unveils Realities of Generative AI Adoption in the Enterprise
- Big Growth Forecasted for Big Data
- Anaconda’s Commercial Fee Is Paying Off, CEO Says
- More News In Brief…
Most Read This Just In
- Salesforce Announces New Automotive Cloud Features
- Martian Raises $9M for Advanced Model Mapping to Enhance LLM Performance and Accuracy
- Voltron Data Launches Theseus to Unlock the Power of the Largest Data Sets for AI
- Dremio Delivers GenAI-Powered Data Discovery and Unified Path to Apache Iceberg on the Data Lakehouse
- HPE Collaborates with NVIDIA to Deliver an Enterprise-Class, Full-Stack GenAI Solution
- AMD Instinct MI300 Series Launch: Accelerating Next-Gen AI and Supercomputing
- DataStax Launches New Integration with LangChain, Enables Developers to Build Production-ready Generative AI Applications
- Amazon Aurora MySQL zero-ETL Integration with Amazon Redshift Now Generally Available
- Terra Quantum Announces Partnership with NVIDIA for Quantum-Enhanced Data Analytics
- AWS Announces 4 Zero-ETL Integrations to Make Data Access and Analysis Faster and Easier Across Data Stores
- More This Just In…