Follow Datanami:
November 11, 2015

CMU’s Visualization Tool Targets ‘Dimensional’ Data

Among the challenges posed by exploding volumes of raw data is the emergence of what data scientists call “high-dimensional” data, that is, data with many parameters. Genomic and demographic data sets, for example, are becoming so complex that automated tools are running out of steam.

Now comes a web-based tool called Explorable Visual Analytics, or EVA, developed by Carnegie Mellon University researchers. EVA is based on a new computer architecture that combines speed with data compression to allow analysts to make sense of high-dimensional data.

Researchers at Carnegie Mellon’s Robotics Institute in Pittsburgh reported this week they have refined the EVA design by crunching a multi-dimensional database on the U.S. workforce drawn from the U.S. Census Bureau’s Longitudinal Employer-Household Dynamics program.

“We can explore massive data sets without downloading them all,” noted EVA co-developer Saman Amirpour Amraii, a senior system and software engineer at the Robotics Institute. Hence, the bulk of the data can reside in an external network while EVA downloads only that portion being analyzed.

In one scenario, the tool can grab map-based data, process it and display only the desired data in the highest resolution possible. The tool uses a data processing pipeline that includes pre-processing and caching of data on servers. The data is then compressed to avoid communications bottlenecks and cached on client computer to improve responsiveness.

The process delivers only a small portion of the data to the client computer while providing users with the “illusion” they are working with a massive data set, the developers said.

EVA’s faster response times mean users can analyze different data parameters using the most appropriate visualizations.

The system was fine-tuned using the 100-gigabyte Census Bureau workforce database that includes detailed demographic and employment parameters. The researchers used the database to analyze the racial demographics of neighborhoods in Philadelphia to determine how neighborhoods are being integrated or segregated over the course of several years.

The researchers also used data from the World Resource Institute to analyze deforestation trends as well as data from the U.S. Department of Transportation.

While improving the ability to find patterns in high-dimensional data, the researchers said their EVA approach could eventually be used as a general-purpose numerical data visualization tool for business intelligence applications.

The primary goal of the EVA project was to maintain system performance and response times as data sets grow larger. “If it takes a half hour to get an answer to your query, you may forget why you asked in the first place,” noted Randy Sargent, a senior systems scientist at Carnegie Mellon.

EVA also allows users to share their conclusions as well as the analysis used to reach them. The tool preserves links to the underlying data so others looking at the same data can consider alternative analyses.

“It keeps the presenters honest,” said Sargent. “If they cherry pick the data, it will be obvious.”

Google Inc. (NASDAQ: GOOG) sponsored the visual analytics research.

Datanami