January 22, 2013

Gnip’s Take on the Library of Twitter

Ian Armas Foster

Last week, we highlighted the Library of Congress’s effort in archiving the entire Twitter database. The project, while daunting, has the potential to be hugely useful to researchers in the humanities as they explore the intricacies of human interaction—such as the evolution of journalism and social uprisings—in the social media age.

The report released by the LOC mentioned social media data mining company Gnip, which is responsible for handling and delivering the Twitter data. Chris Moody, President and COO of Gnip, spoke to Datanami on the challenges and the prospects going forward of the Twitter-LOC project.

“The Library of Congress initiative aligns very much with what we’re trying to do,” Moody said. “Gnip was formed on the idea that we believe social data has unlimited value and near-limitless applications.”

While the effort to archive Twitter’s databases in the Library of Congress gains notoriety because of the name recognition, Moody noted that the technological challenges in setting up the LOC’s archive are not much different from what they do for their private financiers.

The biggest difference, Moody said, arises in the type of research that is being done. When Moody mentions social data’s “near-limitless applications,” he is in part referring to potential future studies on disasters, government uprisings the likes of which have already been proposed. “The Library project is particularly fun and interesting for us,” Moody said, “because it’s often the research community that’s digging into some of the more interesting, fascinating things that could potentially be done with this data, whether it’s studying disease outbreaks or disaster projects or political issues.”

One of the biggest challenges, Moody noted, was dealing with the evolving nature of Twitter. While the idea of posting sub-140 character tidbits has remained mostly the same over the last seven years, the types of metadata available along with the storage-based intricacies have changed.

“In the earliest days, pulling together the archives in some kind of form was difficult for Twitter and the library,” said Moody. “Twitter as a service has evolved over time, what a tweet looked like in 2006 was very different from what a tweet looked like in 2008 which was different from what a tweet looked like in 2010. It’s a very dynamic data set and data model.”

The development of hashtags and multiple forms of retweeting contribute to the dynamics, as popular hashtags indicate an event followed by many while retweeting details their interactions.

The tweet database from 2006-2010 comprised of 20 terabytes (when uncompressed from Gnip’s delivery) containing the text and metadata behind 21 billion tweets. Gnip continues to send the LOC Twitter’s archive on a six-month delay, with over 170 billion tweets and 133 terabytes having come in over the last couple of years.

Currently, the LOC is having trouble archiving their Twitter database, with research queries taking 24 hours to run. We covered the topic at length in last week’s feature and, despite the looming growth of Twitter, the LOC seems confident that a solution will be found once archiving and processing power meets storage power.

Moody echoed that message, noting that Gnip’s efforts on that front were well spelled out by the Library.

Related Articles

Building the Library of Twitter

The Algorithmic Magic of Trendspotting

Twitter Flies by Hadoop on Search Quest

Applications: Data Mining, Research Analytics

Technologies: Storage

Sectors: Academia, Government

Tags: Gnip, Library of Congress, twitter

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

Gnip’s Take on the Library of Twitter

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 14, 2024

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

Gnip’s Take on the Library of Twitter

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

May 14, 2024

May 13, 2024

May 10, 2024

May 9, 2024

May 8, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link