January 6, 2014

NCSA Project Aims to Create a DNS-Like Service for Data

Alex Woodie

Researchers at the University of Illinois and its National Center for Supercomputing Applications (NCSA) are working to create a “DNS Server for data” that would automatically add structure to unstructured data as users come across it in the wild. The researchers recently received a $10.5-million National Science Foundation grant to develop the software as part of its “Brown Dog” project.

The lack of metadata or other data that adds structure to a digital file is the source of much consternation in our highly connected world. Without a way to identify the format of a given piece of unstructured or semi-structured data–such as video, audio, images, or even handwritten documents–the value of that information remains locked away, and becomes essentially useless. It’s a problem that affects all computer users, and impacts all forms of data, whether it was created one second ago or one century ago.

It’s a particularly vexing issue for those tasked with curating data for long-term storage, such as US National Archives and Records Administration (NARA). Researchers at the NCSA’s Image and Spatial Data Analysis Division have worked with NARA to help them deal with the glut of random pieces of information they receive, according to Kenton McHenry, Ph.D., a research scientist and Adjunct Assistant Professor of Computer Science at the university.

“We’ve been working with unstructured, uncurated data, for some time now,” McHenry tells Datanami. “We’ve had a lot of work with NARA to deal with the kind of data that they get. They get data dumped on them–hard drives, disks, CDs, whatever–and they’re told to preserve it for long periods of time, for all the big government agencies.”

McHenry’s group works with other government institutions as well, including the NSF and others, and is tasked with helping them curate unstructured collections of files, including images, scanned documents, old maps, paintings, pictures, and even census figures. Such “long-tailed” data could potentially be of great use in the scientific and academic arenas. But without any built-in structure or metadata to inform to user as to its contents, the data is just one big mess.

McHenry and his colleagues at NCSA began the Brown Dog project with the hopes of finding a way to add structure to this unstructured data. Brown Dog is a “super mutt” of various existing software tools that attack pieces of the problem of unstructured data, explains McHenry, the project leader. The idea is to create an easy to use and extensible framework that allows users to tap into those existing tools.

Brown Dog is to be composed of two main pieces, including the Data Access Proxy (DAP) and Data Tilling Service (DTS). The DAP will enable users to gain access to data that’s stored across a large number of different file formats, while DTS will be used to automate the process of adding additional structure to the data contained in the files. The DTS will make use of two existing data detection and extraction tools, including Versus, a content-based comparison framework, and Medici, an extraction service.

The Medici component will allow users to detect patterns in human readable data, such as audio, video, and text. Depending on the classifiers that a user sets up, the software would automatically tag pieces of data that match the classifier. This would allow the software to identify pieces of pertinent information, whether it’s species of trees or faces or types of dogs, McHenry says. “So all of these classifiers would exist, and whenever a classifier fires off and says ‘Oh I detected something,’ it would associate itself with that image in the form of a tag that says ‘I detected X’ or ‘I detected Y,’” McHenry says. “And so you can use that text to search through that kind of unstructured content.”

The Versus component, meanwhile, would work against information that’s non-human readable. “You extract some sort of numerical representation that somehow, semantically is associated with the contents of the data in meaningful way,” McHenry says. “Basically you’d get a database of content–for example, images. Then you have a query image that you want to search the database with. It’s not text–it’s an example of that content. And then it returns to you the top 10 or top 100 most similar-looking things as compared to that, via that signature, which is used to do the comparisons.”

With the Versus and Medici engines powering the DTS, users would have a quick way to add some semblance of context to all types of content, including audio, video, image, and text files, McHenry says. DTS will effectively enable users to build an indexing system based on a sample of a given piece of unstructured information.

McHenry likens this approach to building a DNS (Domain Name Service) server for data. “The idea is you would just call the service once in a while on your data, and you would access it through your browser or applications,” he says. “And basically it would deal with these aspects of formats and unstructured data for you. Your application doesn’t have to worry about it itself. It just calls these services to get out keywords, to get out formats, and so forth.”

Users would no longer need to hunt for the right program to open up an ancient image based on archaic file format. “In your browser, you don’t have to worry about file formats anymore. You don’t have to worry about key words associated with images or unstructured data,” McHenry says. “It just kind of works, so file formats get automatically translated to what’s available on your machine. You could search through unstructured data like images just by typing in some text, or by giving it example images.”

This approach would bear structured fruit not just for the academic and scientific communities, but for any group of users that deal with large amounts of unstructured data. In other words, just about anybody who has struggled to work with big sets of unstructured data.

The capability to automatically give some semblance of structure to large amounts of unstructured and semi-structured data would have any number of uses in the big data analytics world. Lots of content generated on social media websites, after all, is largely unstructured. Many technologies have developed to help users get a handle on these huge social data sets, including NoSQL databases and graphing engines. Brown Dog could potentially provide one more way to tackle this problem.

“For us, big data isn’t just the fact that it’s a bunch of bytes. It’s also the fact that it’s uncurated data–data with no metadata, no useful file names, no useful directory structure, just a mess of files,” McHenry says. “It’s not big in terms of petabytes, but it’s big in the sense that a human can’t do what they need to do to curate it, to preserve that data.”

In addition to McHenry’s team at the NCSA, researchers at the University of Illinois at Urbana-Champaign, Boston University, and the University of North Carolina at Chapel Hill are working on Brown Dog. The NSF grant is good for five years, after which McHenry hopes Brown Dog will continue as an open source project.

Trifacta Gets $12M to Refine Raw Data

Data Scientists–Who Needs Them Anyway?

Applications: Data Mining, Research Analytics

Technologies: Middleware

Sectors: Academia

Vendors: Startups and More...

Tags: unstructured data

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

NCSA Project Aims to Create a DNS-Like Service for Data

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

NCSA Project Aims to Create a DNS-Like Service for Data

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link