Driving MapReduce into the Semantic Web
At the turn of the century, a revolution took place on the infant World Wide Web that transformed static pages into interactive, collaborative portals, spawning the social media era that permeates today. That transition was called Web 2.0, and there’s an open discussion about how semantic data will create the next transition. A group at the University of Freiburg is working on ways to enable MapReduce to pave the way.
At the 12th International Semantic Web Conference (ISWC) in Sydney, Australia later this month, researcher Alexander Schätzle says that his group will be presenting on PigSPARQL, a SPARQL query processing system built on top of MapReduce. This new system, they hope, will pave the way for the enablement of the semantic Web for people who are not MapReduce programmers.
For those who aren’t familiar with the concept of the semantic Web, it’s essentially machine learning for the World Wide Web. Using metadata and algorithms that help computers understand what content people are searching for, the machines can then MapReduce their way to delivering it.
It’s able to smartly do this using a family of W3C specifications known as the Resource Description Framework (RDF), which is essentially a metadata framework that can be used to identify and link data to other data, giving the machine intelligence a handle. While RDF (and its corollary RDF triples) are outside of the scope of this article, the important thing to understand is that RDF data is growing exponentially, and the potential for the semantic web is burgeoning before us (despite a few flies in the ointment which we’ll discuss soon).
To demonstrate the move in this direction, consider that the data in the Linking Open Data community project has grown from 12 datasets in 2007 to about 300 datasets today. Contained in these open data sets is over 31 billion RDF triples, with around 504 million RDF links contained within them. This is truly big data on a grand scale, and as it happens, MapReduce happens to be both the enabler and inhibitor to progress in this arena.
While it takes MapReduce to manage this enormous amount of data, it’s no big surprise to anyone that MapReduce is hard to use., “MapReduce means writing a lot of code – especially Java code – and you have to reinvent the wheel because common operations like joins do not exist out of the box,” Schätzle says. This problem is as old as MapReduce itself, which is why Yahoo! developed Pig back in 2006 to wick away the complexity from the MapReduce soup, making it easier to use, thus opening it up to a wider user base.
This is a similar junction that the movement towards the semantic web finds itself. While SPARQL is the querying language which best lends itself to RDF data, it is not easily translatable to Pig, and thus an easy MapReduce programming experience. Pig Latin is an imperative language, and SPARQL is more declarative – similar to SQL.
However, Schätzle says his research group has developed an answer for this, which they’re calling PigSPARQL. The project, he says, enables developers to express every SPARQL operator by an equivalent of Pig Latin expressions, thus making the SPARQL query language executable (through Pig) on Hadoop out of the box.
If it works, it has the potential to open up a whole new world of web development, taking MapReduce from a metaphorical dirt road to a four-lane highway by giving developers in the semantic web space broader access to its functionality.
The group will be presenting its project findings later this month at the ISWC show in Sydney. In the meantime, here is a video describing the project in greater detail: