Follow Datanami:
October 24, 2023

Treeverse Releases lakeFS 1.0 to Improve Lifecycle Management of Data Lakes and Integration with Modern Data Stack

NEW YORK and SAN FRANCISCO, Oct. 24, 2023 — Treeverse, the creator of lakeFS, the open-source technology that brings data version control to data lakes, today announced the release of lakeFS 1.0, a scalable and highly reliable system with backward compatible interfaces within the modern data stack.

The explosion in the volume of generated data has left data practitioners with a striking challenge: they lack the tools and systems to scale their practices to the volumes of data they have. For example, today it is commonplace for machine learning (ML) researchers to extract subsets of data to their local environments by using the age-old copy/paste approach for model training. This leaves the data error-prone, difficult to reproduce, and hard to trace. On top of this, cross-team collaboration is nearly impossible, leaving teams chasing data versions.

Treeverse introduced lakeFS, its open-source software (OSS) for scalable data version control in August 2020 to bring order to the chaos that comes with developing and maintaining data products based on petabytes of data. The OSS version of lakeFS was adopted by organizations across all business verticals, including large enterprises such as Volvo, Lockheed Martin, Woven Planet by Toyota, EPCOR and Arm.

With the release of version 1.0, lakeFS is officially production grade in stability, security and performance – having been thoroughly vetted in production by thousands of companies over the last three years since its initial release. Key capabilities include:

  • lakeFS is the only scalable, high-performance data version control option in the market suitable for enterprise-level data operations.
  • The open-source technology is highly compatible with the emerging standard for data lake architecture in all its components and effortlessly integrates with any existing data lake.
  • lakeFS enables data professionals to manage data as code and increases the efficiency of data engineers, ML/AI practitioners and analysts throughout the lifecycle of the data.
  • From raw data pre-processing, through deduplicated and parallel experimentation, to full reproducibility feature engineering and model training, lakeFS increases data quality and data delivery velocity, while reducing storage costs.

“We built lakeFS for high-scale, high-performance data operations that run data pipelines for machine learning, AI and analytics, on unstructured or structured data,” said Dr. Einat Orr, Co-founder and CEO at Treeverse. “This is why we see many enterprises in our OSS community and cloud customer base. Our integrations with the modern data stack are a strategic move towards enterprises with large scales of data.”

“One of the best decisions I’ve made at Enigma was incorporating lakeFS as a foundational element of our data stack,” said Ryan Green, CTO at Enigma. “Data branching allows our model developers to run isolated, reproducible experiments on the complete ML pipeline with minimal friction. This has translated directly to higher development velocity, more iterations and happier customers.”

Integration partners include Databricks, Cloudera, AWS S3, Min.IO and Azure Gen2. The lakeFS open-source technology is now available to use with a wide range of tools and frameworks, including Microsoft Azure, Databricks Unity Catalog, Apache Iceberg, orchestration tools including Apache Airflow, Prefect and Dagster. Forthcoming lakeFS releases are guaranteed to maintain backwards compatibility for the life of the 1.x version, meaning existing code and functionality will continue to be fully operational even as new features are added.

“MinIO is the object storage of choice for machine learning workloads, due to its combination of performance and scalability,” said AB Perisamy, CEO at MinIO. “With lakeFS, MinIO can serve as a version controlled data repository for machine learning – from data preprocessing through research – all the way to machine learning production pipelines. The combination of lakeFS and MinIO creates a high quality process for delivering models to all data practitioner stakeholders. The team at lakeFS continues to impress us with the products they are building.”

To learn more about lakeFS 1.0 visit https://lakefs.io.

About Treeverse Inc.

Founded in 2020 by Oz Katz and Dr. Einat Orr, Treeverse is the company behind lakeFS, the open-source technology that brings scalable data version control to data lakes. Treeverse investors include Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures. To learn more, visit https://treeverse.io.


Source: Treeverse

Datanami