Apache Arrow Takes ‘Flight’ with Big Data Net
A data transport framework released by the Apache Arrow community aims to alleviate some of the pain associated with accessing large data sets over networks.
Apache Arrow Flight is described as a general-purpose, client-server framework intended to ease high-performance transport of big data over network interfaces. A recent release of Apache Arrow includes Flight implementations in C++ and Python, the former with Python bindings.
“One of the biggest features that sets apart Flight from other data transport frameworks is parallel transfers, allowing data to be streamed to or from a cluster of servers simultaneously,” Apache Arrow evangelist Wes McKinney explained in a blog post unveiling Arrow Flight. “This enables developers to more easily create scalable data services that can serve a growing client base.”
Developers note that the performance of standard network protocols can vary significantly depending on use case. Flight is designed as a new protocol for data services using the Apache Arrow columnar format as a data representative as well as a public API for developers. The approach seeks to reduce serialization penalties associated with data transport while increasing the overall efficiency of distributed data platforms, McKinney said.
Flight libraries allow developers to roll out networking services capable of sending or receiving data streams. Among the request types provided by a Flight server are lists of available data streams, data stream schema and sending requested data streams to a client.
Early benchmark testing of the C++ version of Flight delivered throughput performance ranging between 2-3 Gb/s, with data transfer rates of about 12 gigabytes in roughly four seconds.
Flight’s proponents note that many distributed database systems transport data sets multiple times in delivering them to clients. That approach “presents a scalability problem for getting access to very large data sets,” McKinney said. “We wanted Flight to enable systems to create horizontally scalable data services without having to deal with such bottlenecks.”
Flight libraries are deemed sufficiently mature for beta users, though developers expect some “minor” API or protocol changes as Flight is wrung out. Examples of a Flight client and server using the Python API are here.
Meanwhile, Arrow community member and data lake specialist Dremio has developed a connector based on Arrow Flight that delivered as much as a 50-fold performance increase over the Open Database Connectivity standard API. McKinney said a data source implementation aimed at Apache Spark users connects to Flight-based network endpoints.
Future development is also expected to focus on creating data services enabled by the data transport scheme. “Since Flight is a development framework, we expect that user-facing APIs will utilize a layer of API veneer that hides many general Flight details and details related to a particular application of Flight in a custom data service,” McKinney added.