Follow Datanami:
January 16, 2013

Building the Library of Twitter

Ian Armas Foster

On an average day people around the globe contribute 500 million messages to Twitter. Collecting and storing every single tweet and its resulting metadata from a single day would be a daunting task in and of itself.

The Library of Congress is trying something slightly more ambitious than that: storing and indexing every tweet ever posted.

With the help of social media facilitator Gnip, the Library of Congress aims to create an archive where researchers can access any tweet recorded since Twitter’s inception in 2006.

According to this update on the progress of the seemingly herculean project, the LOC has already archived 170 billion tweets and their respective metadata. That total includes the posts from 2006-2010, which Gnip compressed and sent to the LOC over three different files of 2.3 terabytes each. When the LOC uncompressed the files, they filled 20 terabytes’ worth of server space representing 21 billion tweets and its supplementary 50 metadata fields.

It is often said that 90% of the world’s data has accrued over the last two years. That is remarkably close to the truth for Twitter, as an additional 150 billion tweets (88% of the total) poured into the LOC archive in 2011 and 2012. Further, Gnip delivers hourly updates to the tune of half a billion tweets a day. That means 42 days’ worth of 2012-2013 tweets equal the total amount from 2006-2010. In all, they are dealing with 133.2 terabytes of information.

While receiving and storing these tweets is not necessarily a problem, archiving and indexing them such that they are useful to researchers can be. The LOC stores all of its digital data in tape form—good for durable long-term storage, but not ideal for file recall over such a large scale. It is noted in the report that currently it takes 24 hours to run a search on the Twitter archive, an unacceptable figure for a researcher hoping to publish a study in any reasonable amount of time.

One of the primary issues lies in determining how to organize the information. The LOC system for organizing texts is a relatively efficient and intuitive. Many have encountered and navigated the system—an alphanumeric ordering based on field of study, subfield of study, and year published—through various university libraries. It is a system designed such that humans can navigate large towers of texts with relative ease.

It would take too large a team and too much money for a public institution to classify every tweet in existence in the LOC ordering system. So whatever system they come up with must be easily communicable by a computer program and easily accessible to researchers not necessarily well-versed in computer science.

By all accounts, the LOC plans on taking its time in coming up with the right system. “It is not uncommon,” noted the report, “for the Library to spend months or in some cases years sorting a large acquisition to inventory, organize and catalogue the information and materials so they are accessible by researchers.”

In this case, those years may exist out of necessity. Plenty of well-backed financial institutions and larger organizations may have the resources to search the entirety of Twitter for trend spotting and research. Indeed, some have turned to Gnip to help deliver and provide a framework for the searching of the 170 billion tweets in existence.

As a public entity, the LOC does not have such resources. “To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost prohibitive and impractical for a public institution.”

This cost and resources gap makes sense from a certain angle. There is a significant amount of financial incentive for corporations to examine public sentiment from Twitter. Competing large-scale online retailers look for any edge to push an individual customer toward a product they might buy. That edge measured over several million visitors per year could potentially be worth millions of dollars.

On the other hand, the LOC sees the Twitter archive as simply another description of the world as it is happening. The report argues that modern journalism is being supplemented, and in some instances being replaced by, social media reports. Those who are doing the recording of history are doing so more in realtime over Twitter

Historical studies based on those archives promise to be intriguing; the LOC has already received a research proposal that studies the effect of citizens on social media on events such as terrorist attacks and state revolutions. It is thought that the use of Facebook and Twitter to disseminate information contributed significantly to the Arab Spring uprisings in Egypt and Libya. It is also believed that Syrian dictator Bashar al-Assad managed to shut down all IP addresses in Syria to prevent both intra-Syrian communication among rebels and international journalism from Damascus.

Social media-based studies, particularly ones that would be based on a full tweet archive, could tap into how Twitter has influenced political dynamics and how it could continue to do so.

As it stands now, however, the LOC does not have the resources to permit efficient Twitter archive access. Through their partnership with Gnip they will reportedly continue to explore private partnerships to foster “research and scholarship focused interfaces.”

The Library of Congress expects to make a limited version available for research access in the early part of this year. In the long term, they are confident that the ability to process, archive, and index the data will eventually match the ability to collect, send, and store it. However, with 20 terabytes—a figure that is sure to rise consistently over the ensuing months and years—coming in every six weeks, the Twitter archive may hit the petabyte scale by the time they are able to manage the terabyte range.

However, if they are able to eventually intuitively index their Twitter archive, the value to the humanities research community would be immense.

Related Articles

The Algorithmic Magic of Trendspotting

Twitter Flies by Hadoop on Search Quest

Greenplum, Kaggle Team to Prospect Data Scientists

Datanami