The Here and Now of Big Geospatial Data
No matter how sophisticated information technology gets – and who can deny that IT is evolving exceptionally fast these days – there’s nothing that can replicate the combination of two unique pieces of data: Time and place. That’s why the place for geospatial data is here, and the time is now.
The advent of big data analytics has enabled companies to answer all sorts of questions that they couldn’t have before, like precisely who bought exactly what, and when. As data practitioners get deeper into the technology – and especially as they dive into the world of real-time analytics driven by smart phones and other devices connected to the Internet of Things (IoT) – they’re increasingly turning to geospatial data to optimize the delivery of products and services for people as they move about the real world.
In a recent Forrester Wave report on geospatial tools, Forrester analyst Rowan Curran writes that geospatial insights are not only involved in collecting data for sales, CRM, customer support, HR, and marketing initiatives, but they’re also used in the delivery of services.
“These allow companies to use spatial data to drive unprecedented levels of understanding and analysis of users’ habits and behavior,” Curran writes, “but they also provide the platforms to deliver the messaging, content, and other actions directly to users in the most appropriate context.”
The potential applications of geospatial data are vast. Consider these recent real-world examples:
- Logistics: The United States Postal Service is using big geospatial analytics to optimize mail route planning and reduce delivery times;
- Fraud detection: By tracking the location of attempted credit card transactions—and specifically the physical distance between them—banks have a new tool for detecting fraudulent activity in real time;
- Retail: Chains Macy’s are turning to location-sensing technology to deliver a better in-store experience to customers and challenge ecommerce sites for business;
- Finance: Investors are turning to satellite or drone-based imagery as a source of data to inform decisions, such as assessing the valuation of commodities trades or predicting consumer demand;
- Shipping: Tracking the movement of about 21 million shipping containers atop 100,000 ships in the maritime fleet and using machine learning algorithms to optimize their flow can save millions of dollars;
- Advertising: American Express generated promotions to customers based on purchase history and location, thanks to a geo-tagging solution from Foursquare;
- Entertainment: Pokémon GO showed how the overlay of cyberspace upon the real world can deliver a compelling augmented reality (AR) experience;
- Journalism: Reporters and editors are turning to advance geospatial tools like OpenStreetMap to help tell compelling stories.
There’s no doubt that geospatial data brings a big potential upside for delivering much-needed context to decision making in all sorts of areas. No matter how much of our lives now exist in cyberspace, our terrestrial ties make it important to know where and when people and things exist in the real-world.
Big Geospatial Challenge
However, geospatial data presents a unique set of challenges as well. Depending on how often a tracked device emits its location, the volume and velocity of geospatial data can be the first barrier to successfully leveraging big geospatial data. Traditional relational databases from the likes of Oracle and IBM support geographic data types and queries, often through extensions to the core database.
But these scale-up databases are largely seen as insufficient for the scale of emerging big data use cases. Increasingly, scale-out databases are being used to track big, high-velocity data. It’s no wonder that NoSQL databases are being asked to serve and process geolocation data.
MongoDB, for example, supports storing geolocation data in a JSON document, and also supports some geo-specific query types too. Redis, the super-fast key-value store, has also proven itself adept at storing and serving the two key pieces of data required in geospatial computing: the X and the Y coordinate, or what Redis calls the Geo Set. Geospatial capabilities can also be found in document and wide-column NoSQL databases from Aerospike, Datastax, and Couchbase, in addition to graph stores from the likes of Neo Technologies and MarkLogic.
By some estimates, up to 80% of all data being generated today has a geospatial component. (This is likely due to video, being the biggest data of them all, and the geo-tagged capability that’s enabled when video is taken from a smart phone.) In a business setting, companies may turn to Hadoop to extract insights from geospatial data.
This is precisely what the US trucking company US Xpress is doing. According to a Deutsche Bank white paper, US Xpress is using Hadoop to process and analyze a range of data collected from trucks, including geospatial data, as well as data from tire pressure monitors and engine monitors. According to the bank, the trucking firm is saving millions of dollars per year.
Specialized Geospatial Databases
General purpose data systems like Hadoop, NoSQL, and relational databases, however, are not well-suited for many geospatial use cases. Increasingly, the difficulty in storing geo-location data has given rise to a collection of specialized databases that are specifically geared toward storing geospatial data.
The giant in the geographic information system (GIS) space is California-based Esri, whose ArcGIS product underlies many geo-powered applications. In the open source arena, PostGIS, which overlays a geospatial component atop the Postgres relational database, has a large following. Another open group looking to set standards in the space is Open Geospatial Consortium (OGC), whose goal is to “empower technology developers to make complex spatial information and services accessible and useful with all kinds of applications.”
Databases from Space-Time Insight, CARTO, and SpatialDB are also helping to make processing geospatial data easier. J. Andrew Rogers, who helped build Google Earth, found that the PostGIS tool was insufficient for the work he was trying to do, so he developed his own sharded geospatial engine called SpaceCurve.
Still, other vendors are taking entirely new approaches to ingest and process geospatial data at scale. One of the up-and-coming firms to keep an eye on is Kinetica (formerly GIS Federal). The company’s GPU-powered database, called GPUdb, has been adopted by the USPS, which recently installed tracking devices on about 200,000 mail delivery vehicles as part of its geospatial program. The devices emit a ping every minute, which adds up to about 250 million location data points collected each day. To enable queries on all that data, USPS tapped GPUdb, which runs on a cluster composed of about 200 X86 and GPU processing nodes.
Another firm enabling customers to visualize big geospatial on the fly is MapD. The company, which was spun out of Todd Mostak’s graduate project at MIT, fuses a GPU-based database with a collection of visualization tools to enable users to work with huge geospatial data sets at interactive speed (look for an upcoming feature story from Datanami on MapD).
As the cyber and physical worlds become more intertwined, we’ll increasingly look to geospatial data to enable us to track the location of people and things as they move and to power a new class of location-based services. However, some aspects of geospatial make it difficult to work with. Companies that can master big geospatial data and integrate it with user-facing apps will hold a competitive edge for the foreseeable future.