Follow Datanami:
March 8, 2024

Apache Arrow Announces DataFusion Comet

Apache Arrow, a software development platform for building high-performance applications, has announced the donation of the Comet project.  

Comet is an Apache Spark plugin that uses Apache Arrow Datafusion to improve query efficiency and query runtime. It does this by optimizing query execution and leveraging hardware accelerators.

With its ability to allow multiple analytics engines and accelerate analytical workload on big data systems, Apache Arrow has become increasingly popular with software developers, data engineers, and data analysts. With Apache Arrow, users of big data processing and analytics engines, such as Spark, Drill, and Impala can access data without reformatting.  Comet aims to accelerate Spark using native columnar engines such as Databricks Photon Engine and open-source projects such as Sparks RAPIDS and Gluten.

Interestingly, Comet was originally implemented at Apple, and the engineers on that project are also contributors to Apache Arrow Data Fusion. The Comet project is designed to replace Spark’s JVM-based SQL execution engine by offering better performance for a variety of workloads. 

The Comet donation will not result in any major disruption for users as they can still interact with the same Spark ecosystem, tools, and APIs. The queries will still be through Spark’s SQL planner, task scheduler, and cluster manager. However, the execution is delegated to Comet, which is more powerful and efficient than a JVM-based implementation. This means better performance with no Spark behavior change from the end users’ point of view.

(Tee11/Shutterstock)

Comet supports the full implementation of Spark operators and built-in expressions. It also offers native Parquet implementation for both the writer and the reader. Users can also use the UDF framework to mitigate existing UDF to native. 

As different applications store data differently, developers often have to manually organize information in memory to speed up processing, however, this requires extra effort and time. Apache Arrow helps solve this issue by making data applications faster so organizations can quickly extract more useful insights from their business data, and enable applications to easily exchange data with one another. 

 The co-founder of Apache Arrow, West McKinney, was one of Datanami’s People to Watch 2018. In an interview with Datanami that year McKinney shared that as big data systems continue to grow more mature, he hoped to see “increased ecosystem-spanning collaborations on projects like Arrow to help with platform interoperability and architectural simplification. I believe that this defragmentation, so to speak, will make the whole ecosystem more productive and successful using open source big data technologies.”

With the Comet donation, Apache Arrow will get to accelerate its development and grow its community. With the current momentum toward accelerating Spark through native vectorized execution, Apache believes that open-sourcing will benefit other Spark users. 

Related Items 

InfluxData Revamps InfluxDB with 3.0 Release, Embraces Apache Arrow

Voltron Data Unveils Enterprise Subscription for Apache Arrow

Dremio Announces Support for Apache Arrow Flight High-performance Data Transfer

 

Datanami