Hugging Face and Databricks Streamline Dataset Creation with Spark
Databricks and Hugging Face have unveiled a new integration that will allow users to create a Hugging Face dataset from an Apache Spark dataframe.
Databricks has written and committed these Spark changes to the Hugging Face repository. A new function, from_spark, allows users to employ Spark for efficiently loading and transforming data for training or fine-tuning a large language model, the company says. Users can then map their Spark dataframe into a Hugging Face dataset for integration into their training pipelines.
A blog post from the Databricks team explains how the company has been receiving requests from users asking for an easier way to load their Spark dataframe into a Hugging Face dataset. Previously, users were required to store data in Parquet files and subsequently reload them through Hugging Face datasets, because Spark dataframes were not compatible, even though the platform supported a wide variety of input formats. They say that this prior process of loading data was tedious and cumbersome and ate up more resources, time, and costs.
Databricks claims the new method enabled by this collaboration resulted in 40% less processing time when tested on a 16GB dataset, going from 22 minutes down to 12 minutes.
“As we transition to this new AI paradigm, organizations will need to use their extremely valuable data to augment their AI models if they want to get the best performance within their specific domain,” the Databricks team writes. “This will almost certainly require work in the form of data transformations, and doing this efficiently over large datasets is something Spark was designed to do.”
Apache Spark is a popular data processing framework that leverages parallel computing to enable data processing tasks on very large datasets. Databricks was founded by the original creators of Spark. Its platform is built on top of Spark and adds additional features and optimizations to the core Spark framework.
Hugging Face is known for its open source approach to AI, particularly with natural language processing and Transformer models, and makes its tools and libraries accessible to everyone, from developers and researchers to amateurs and non-technical users.
Databricks says it sees this release as a new avenue to further contribute to the open source community and calls Hugging Face the “de facto repository” for open source models and datasets. The company anticipates this to be the first of many contributions while hinting at future plans to add streaming support through Spark to make dataset loading even faster.
“It’s been great to see Databricks release models and datasets to the community, and now we see them extending that work with direct open source commitment to Hugging Face,” said Hugging Face CEO Clem Delange in an announcement. “Spark is one of the most efficient engines for working with data at scale, and it’s great to see that users can now benefit from that technology to more effectively fine-tune models from Hugging Face.”