January 6, 2014

NCSA Project Aims to Create a DNS-Like Service for Data

Alex Woodie

Researchers at the University of Illinois and its National Center for Supercomputing Applications (NCSA) are working to create a “DNS Server for data” that would automatically add structure to unstructured data as users come across it in the wild. The researchers recently received a $10.5-million National Science Foundation grant to develop the software as part of its “Brown Dog” project.

The lack of metadata or other data that adds structure to a digital file is the source of much consternation in our highly connected world. Without a way to identify the format of a given piece of unstructured or semi-structured data–such as video, audio, images, or even handwritten documents–the value of that information remains locked away, and becomes essentially useless. It’s a problem that affects all computer users, and impacts all forms of data, whether it was created one second ago or one century ago.

It’s a particularly vexing issue for those tasked with curating data for long-term storage, such as US National Archives and Records Administration (NARA). Researchers at the NCSA’s Image and Spatial Data Analysis Division have worked with NARA to help them deal with the glut of random pieces of information they receive, according to Kenton McHenry, Ph.D., a research scientist and Adjunct Assistant Professor of Computer Science at the university.

“We’ve been working with unstructured, uncurated data, for some time now,” McHenry tells Datanami. “We’ve had a lot of work with NARA to deal with the kind of data that they get. They get data dumped on them–hard drives, disks, CDs, whatever–and they’re told to preserve it for long periods of time, for all the big government agencies.”

McHenry’s group works with other government institutions as well, including the NSF and others, and is tasked with helping them curate unstructured collections of files, including images, scanned documents, old maps, paintings, pictures, and even census figures. Such “long-tailed” data could potentially be of great use in the scientific and academic arenas. But without any built-in structure or metadata to inform to user as to its contents, the data is just one big mess.

McHenry and his colleagues at NCSA began the Brown Dog project with the hopes of finding a way to add structure to this unstructured data. Brown Dog is a “super mutt” of various existing software tools that attack pieces of the problem of unstructured data, explains McHenry, the project leader. The idea is to create an easy to use and extensible framework that allows users to tap into those existing tools.

Brown Dog is to be composed of two main pieces, including the Data Access Proxy (DAP) and Data Tilling Service (DTS). The DAP will enable users to gain access to data that’s stored across a large number of different file formats, while DTS will be used to automate the process of adding additional structure to the data contained in the files. The DTS will make use of two existing data detection and extraction tools, including Versus, a content-based comparison framework, and Medici, an extraction service.

The Medici component will allow users to detect patterns in human readable data, such as audio, video, and text. Depending on the classifiers that a user sets up, the software would automatically tag pieces of data that match the classifier. This would allow the software to identify pieces of pertinent information, whether it’s species of trees or faces or types of dogs, McHenry says. “So all of these classifiers would exist, and whenever a classifier fires off and says ‘Oh I detected something,’ it would associate itself with that image in the form of a tag that says ‘I detected X’ or ‘I detected Y,’” McHenry says. “And so you can use that text to search through that kind of unstructured content.”

The Versus component, meanwhile, would work against information that’s non-human readable. “You extract some sort of numerical representation that somehow, semantically is associated with the contents of the data in meaningful way,” McHenry says. “Basically you’d get a database of content–for example, images. Then you have a query image that you want to search the database with.  It’s not text–it’s an example of that content.  And then it returns to you the top 10 or top 100 most similar-looking things as compared to that, via that signature, which is used to do the comparisons.”

With the Versus and Medici engines powering the DTS, users would have a quick way to add some semblance of context to all types of content, including audio, video, image, and text files, McHenry says. DTS will effectively enable users to build an indexing system based on a sample of a given piece of unstructured information.

McHenry likens this approach to building a DNS (Domain Name Service) server for data. “The idea is you would just call the service once in a while on your data, and you would access it through your browser or applications,” he says. “And basically it would deal with these aspects of formats and unstructured data for you. Your application doesn’t have to worry about it itself. It just calls these services to get out keywords, to get out formats, and so forth.”

Users would no longer need to hunt for the right program to open up an ancient image based on archaic file format. “In your browser, you don’t have to worry about file formats anymore. You don’t have to worry about key words associated with images or unstructured data,” McHenry says. “It just kind of works, so file formats get automatically translated to what’s available on your machine. You could search through unstructured data like images just by typing in some text, or by giving it example images.”

This approach would bear structured fruit not just for the academic and scientific communities, but for any group of users that deal with large amounts of unstructured data. In other words, just about anybody who has struggled to work with big sets of unstructured data.

The capability to automatically give some semblance of structure to large amounts of unstructured and semi-structured data would have any number of uses in the big data analytics world. Lots of content generated on social media websites, after all, is largely unstructured. Many technologies have developed to help users get a handle on these huge social data sets, including NoSQL databases and graphing engines. Brown Dog could potentially provide one more way to tackle this problem.

“For us, big data isn’t just the fact that it’s a bunch of bytes. It’s also the fact that it’s uncurated data–data with no metadata, no useful file names, no useful directory structure, just a mess of files,” McHenry says. “It’s not big in terms of petabytes, but it’s big in the sense that a human can’t do what they need to do to curate it, to preserve that data.”

In addition to McHenry’s team at the NCSA, researchers at the University of Illinois at Urbana-Champaign, Boston University, and the University of North Carolina at Chapel Hill are working on Brown Dog. The NSF grant is good for five years, after which McHenry hopes Brown Dog will continue as an open source project.

Related Items:

NCSA Celebrates Big Computing and Big Data with Petascale Day

Trifacta Gets $12M to Refine Raw Data

Data Scientists–Who Needs Them Anyway?

Share This