Last week, we highlighted the Library of Congress’s effort in archiving the entire Twitter database. The project, while daunting, has the potential to be hugely useful to researchers in the humanities as they explore the intricacies of human interaction—such as the evolution of journalism and social uprisings—in the social media age.
The report released by the LOC mentioned social media data mining company Gnip, which is responsible for handling and delivering the Twitter data. Chris Moody, President and COO of Gnip, spoke to Datanami on the challenges and the prospects going forward of the Twitter-LOC project.
“The Library of Congress initiative aligns very much with what we’re trying to do,” Moody said. “Gnip was formed on the idea that we believe social data has unlimited value and near-limitless applications.”
While the effort to archive Twitter’s databases in the Library of Congress gains notoriety because of the name recognition, Moody noted that the technological challenges in setting up the LOC’s archive are not much different from what they do for their private financiers.
The biggest difference, Moody said, arises in the type of research that is being done. When Moody mentions social data’s “near-limitless applications,” he is in part referring to potential future studies on disasters, government uprisings the likes of which have already been proposed. “The Library project is particularly fun and interesting for us,” Moody said, “because it’s often the research community that’s digging into some of the more interesting, fascinating things that could potentially be done with this data, whether it’s studying disease outbreaks or disaster projects or political issues.”
One of the biggest challenges, Moody noted, was dealing with the evolving nature of Twitter. While the idea of posting sub-140 character tidbits has remained mostly the same over the last seven years, the types of metadata available along with the storage-based intricacies have changed.
“In the earliest days, pulling together the archives in some kind of form was difficult for Twitter and the library,” said Moody. “Twitter as a service has evolved over time, what a tweet looked like in 2006 was very different from what a tweet looked like in 2008 which was different from what a tweet looked like in 2010. It’s a very dynamic data set and data model.”
The development of hashtags and multiple forms of retweeting contribute to the dynamics, as popular hashtags indicate an event followed by many while retweeting details their interactions.
The tweet database from 2006-2010 comprised of 20 terabytes (when uncompressed from Gnip’s delivery) containing the text and metadata behind 21 billion tweets. Gnip continues to send the LOC Twitter’s archive on a six-month delay, with over 170 billion tweets and 133 terabytes having come in over the last couple of years.
Currently, the LOC is having trouble archiving their Twitter database, with research queries taking 24 hours to run. We covered the topic at length in last week’s feature and, despite the looming growth of Twitter, the LOC seems confident that a solution will be found once archiving and processing power meets storage power.
Moody echoed that message, noting that Gnip’s efforts on that front were well spelled out by the Library.