These Cabbies Yield…for Big Data
Riding in a New York City cab can be a harrowing, life-or-death experience. But as the result of a recent study of taxi traffic patterns in the Big Apple, your wild urban adventure is helping to unearth clues about how best to crunch massive amounts of location data, and could pave the way for more elaborate ways that GPU computing can crunch big location data in the future.
University researchers from New York and Oklahoma recently dug into the location data collected from more than 170 million taxi rides taken in 2009. That year, the city collected location data from more than 13,000 GPS-equipped cabs, which made an average of 500,000 trips per day.
The researchers’ goals were twofold. First, they wanted to determine where, exactly, all those trips started and ended, for the purpose of assisting city planners (the results might surprise you). Secondly–and more importantly–they wanted to flesh out new massively data parallel techniques to speed up the complex spatial query tasks.
The amount of data here is so big that existing approaches to crunching location data, such as Geographical Information Systems (GIS), Spatial Databases (SDB), and Moving Object Databases (MOD), simply don’t work. “The huge data volumes have prevented these existing technologies, which are mostly designed for disk-resident systems based on serial algorithms and running on uniprocessors, from achieving performance close to real-time responses to support interactive inquiries,” write the three researchers: Jianting Zhang from City College of New York, Simin You of CUNY, and Le Gruenwald from the University of Oklahoma.
Trying to tackle this problem in a linear, head-on manner is a good way to induce blunt-force trauma of the big data variety. “Our previous experiments have shown that simply uploading the raw data of 170 million taxi trip records in a PostgreSQL database and creat[ing] a geometry column for the pickup locations would cost 100+ hours on a high-end workstation with 48 GB memory and reasonably up-to-date dual Intel Xeon quadcore,” the researchers wrote.
The Hadoop/MapReduce framework has become a popular place to crunch such data in a parallel way. But in this case, the researchers chose to turn this into a graphic processing equation, using GPUS for general purpose computing (or GPGPU).
Here’s how they approached it. First, they adopted land-use types (as determined by New York’s friendly planning commission) as a proxy for trip purposes at the pickup and drop-off locations. Then, they converted the land use types in NYC into three-quarters of a million tax lot polygons sporting 4.7 million vertices. Finally, they formulated a “large-scale nearest neighbor spatial query problem” based on point-to-polygon distance. Then they let the GPUs fly.
It worked. “Experiments have shown that our GPU implementations can complete such complex spatial queries in about 50-75 seconds using an inexpensive commodity GPU device,” the researchers say. “The performance is 10X-20X higher than the host machine with an Intel dual-core CPU when all the cores and hardware supported threads are fully used.”
Can the GPU approach scale? Up to a point, the researchers say. “While the shared-memory parallel hardware of GPU systems are generally considered lacking good scalability when compared with shared-nothing architectures, we argue that the large numbers of processing cores and the high memory bandwidth available on GPUs have made them competitive in solving big data problems up to a certain scale,” they write.
Going forward, the researchers will be looking to integrate the GPU approach into the Hadoop/MapReduce infrastructure to solve problems of a larger scale.
By the way, the researchers also figured out where all those cabs are headed in NYC. The top three pickup locations, as depicted according to land use types (LUTs), are Commercial & Office Buildings, Open Space & Outdoor Recreation, and Mixed Residential & Commercial Buildings, respectively. Those three LUTS are also the top drop-off points, in the same order.
The researchers’ paper, titled “High-Performance Spatial Query Processing on Big Taxi Trip Data using GPGPUs,” was recently published on the City University of New York’s website.