Univa
Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report


January 22, 2013

Gnip's Take on the Library of Twitter


Last week, we highlighted the Library of Congress’s effort in archiving the entire Twitter database. The project, while daunting, has the potential to be hugely useful to researchers in the humanities as they explore the intricacies of human interaction—such as the evolution of journalism and social uprisings—in the social media age.

The report released by the LOC mentioned social media data mining company Gnip, which is responsible for handling and delivering the Twitter data. Chris Moody, President and COO of Gnip, spoke to Datanami on the challenges and the prospects going forward of the Twitter-LOC project.

“The Library of Congress initiative aligns very much with what we’re trying to do,” Moody said. “Gnip was formed on the idea that we believe social data has unlimited value and near-limitless applications.”

While the effort to archive Twitter’s databases in the Library of Congress gains notoriety because of the name recognition, Moody noted that the technological challenges in setting up the LOC’s archive are not much different from what they do for their private financiers.

The biggest difference, Moody said, arises in the type of research that is being done. When Moody mentions social data’s “near-limitless applications,” he is in part referring to potential future studies on disasters, government uprisings the likes of which have already been proposed. “The Library project is particularly fun and interesting for us,” Moody said, “because it’s often the research community that’s digging into some of the more interesting, fascinating things that could potentially be done with this data, whether it’s studying disease outbreaks or disaster projects or political issues.”

One of the biggest challenges, Moody noted, was dealing with the evolving nature of Twitter. While the idea of posting sub-140 character tidbits has remained mostly the same over the last seven years, the types of metadata available along with the storage-based intricacies have changed.

“In the earliest days, pulling together the archives in some kind of form was difficult for Twitter and the library,” said Moody. “Twitter as a service has evolved over time, what a tweet looked like in 2006 was very different from what a tweet looked like in 2008 which was different from what a tweet looked like in 2010. It’s a very dynamic data set and data model.”

The development of hashtags and multiple forms of retweeting contribute to the dynamics, as popular hashtags indicate an event followed by many while retweeting details their interactions.

The tweet database from 2006-2010 comprised of 20 terabytes (when uncompressed from Gnip’s delivery) containing the text and metadata behind 21 billion tweets. Gnip continues to send the LOC Twitter’s archive on a six-month delay, with over 170 billion tweets and 133 terabytes having come in over the last couple of years.

Currently, the LOC is having trouble archiving their Twitter database, with research queries taking 24 hours to run. We covered the topic at length in last week’s feature and, despite the looming growth of Twitter, the LOC seems confident that a solution will be found once archiving and processing power meets storage power.

Moody echoed that message, noting that Gnip’s efforts on that front were well spelled out by the Library.

Related Articles

Building the Library of Twitter

The Algorithmic Magic of Trendspotting

Twitter Flies by Hadoop on Search Quest

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Cray CS300-LC

Sponsored Links

Sponsored Whitepapers

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

Big Data, Big Brains – Sponsored By NetApp

04/22/2013 | NetApp

Big data has proven to be one of the most promising yet challenging technologies for both government and industry. But, before IT leaders can harness the full potential of big data, there are key issues to address surrounding infrastructure, storage, personnel, and training.
MeriTalk surveyed 17 visionary big data leaders to find out what they see as the big data challenges and opportunities as well as how government can best leverage big data. Download the “Big Data, Big Brains Report”.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event