Language Flags

Translation Disclaimer

HPCwire HPC in the Cloud Digital Manufacturing Report Green Computing Report


September 20, 2012

The Algorithmic Magic of Trendspotting


Imagine if every company—from retailers to banks to healthcare outlets—was able to conjure, in real-time, what mattered most to users or consumers.

While large-scale in-house predictive analytics can deliver this in a focused way for some organizations, it’s not available to everyone at a price point that is easily swallowed. Additionally, the algorithmic complexity and scale can be a barrier, unless of course, you’re Twitter and your very purpose is to provide detailed trends for one of the world’s largest communities of users.

The trending topics feature is a valuable Twitter staple, delivering users a near-perfect list of ten topics based on interests and locations—not to mention an instant view of news that is breaking big. Princeton researcher and slinger of data mining algorithms and Twitter, Kostas Tsioutsiouliklis explained how Twitter comes up with those trending topics in a lecture at UC-Berkeley, focusing on the algorithms and sort methods that come up with the ten most relevant topics out of millions.

Before going into what specific algorithms and formulas Twitter uses, it is important to establish parameters. Whatever Twitter does needs to happen quickly. These trending topics change every couple of minutes as new events happen and new memes are created. They also need to account for location and personality.

Canadian pop sensation Justin Bieber is a popular Twitter figure. So too is former mercurial football player Chad Ochocinco/Johnson. At any given time, those two could count themselves among the top trending Twitter topics if Bieber is giving a convert in New York while Ochocinco/Johnson is getting married to a reality television star he met a few months before.

While there will be some crossover, it is unlikely that the majority of Bieber followers also follow Chad and vice versa. Thus, Twitter had to come up with something

According to Tsioutsiouliklis, the method that determines which users should see which trending topics correlates with the method that clumps topics together at an event. This method is called ‘online clustering’ and is accomplished through two different algorithms: TFIDF and MinHash.

TFIDF stands for Term Frequency/Inverse Document Frequency and is actually a relatively simple mathematical function. The function boils down to taking the logarithm of term frequency with respect to terms and documents, adding one to it, and multiplying it by the inverse document frequency. Put simply, the key terms of a group of documents can be found by simply identifying their frequency versus the document count.

As Tsioutsiouliklis explains in the video below, this becomes problematic when the document base is expanding with as great a rate as Twitter’s. The TFIDF algorithm, while simple to understand, is slow and difficult to execute when so many documents are coming in at a time.

This is where an algorithm called MinHash comes in. MinHash is a similarity detection algorithm that assigns hexadecimal storable signatures to each term and determines if those signatures happen with frequency. These signatures determine which cluster a given document should belong to. If certain signatures are consistently being grouped together, like President Obama and Hurricane Isaac or Brad Pitt and Angelina Jolie, those things can be grouped together and co-trend as an event.

NEXT – Untangling the Twitter Code >>

 

Similarly, those signatures can be placed in larger categories, such as sports, North Carolina, or pop music. Thus, an American soccer fan should not get too excited when eight of the ten nationwide trending topics deal with the European Championships in June. It is more an indication that that person cares about soccer (and Twitter gathers this from simply looking at the content of the individual’s tweets and the people that they follow) rather than the entire country.

How does Twitter know which terms are popular in the first place? Simple counting will not do, as common words like “the” and “a” would come up too often. Even if a dictionary were compiled of those simple terms, eliminating them from contention, others are missed, such as contractions and partial phrases.

Something like a ratio that compares the average use of a word versus the current usage is more accurate, but also subject to folly. For example, if a topic is barely mentioned, say five times a day, but is mentioned 25 times on a particular day, that topic would be more “trending” than one whose mentions went from 5,000 to 7,000.

The solution, according to Tsioutsiouliklis, was to create a probability distribution model and chi-square test it. This means that, based on the past month’s worth of tweets, Twitter creates an evolving probability distribution of what percent of tweets they expect to see on a certain subject. They then compare that to the observed result, square the difference (eliminating signs), and are provided with a chi-square score for the topic.

For example, Twitter might expect 30% of a given day’s tweets to be about President Obama. If Obama says/does something inspiring or ridiculous and he ends up occupying 50% of the day’s tweets, his (Expected-Observed)2 would be 400. On the other hand, if the New York Football Giants win a game, pushing their observed from 3% to 5%, its score would be significantly lower, even though it went up proportionally the same.

This method also takes care of random Twitter memes that prop up throughout the day, at least from a national or worldwide perspective. In Greensboro, North Carolina, however, the meme #10PeopleImGladIMet is one of the top trending topics. As Tsioutsiouliklis explains, this has a lot to do with a meme’s not having existed before in any form before, skewing its ratio. While the square-chi formula eliminates that nationally, they can still thrive in places where the overall number of tweets is relatively small. However, according to Tsioutsiouliklis, this is not always a bad thing.

One problem that persists that is difficult to head off are periodic trends. For example, #FF (Follow Friday) defeats its daily average quite handily on Fridays. Further, many Tweeters like to say “good morning” to the Twitterverse when they wake up, propping up “Good morning” stats every day. A solution to this would be to, for each individual topic, create a model of when those periods happen and rate the mentions against the periodic peaks instead of the overall average.

However, one then invariably ends up doing this for nearly every topic. Tsioutsiouliklis noted that they tried to do this and succeeded at Yahoo, but it was very expensive and involved a Hadoop cluster.

Related Stories

What Sociologists Say About Big Data

The NSA, Big Data, and "Total Information Awareness"

Are Predictive Analytics Overrated?

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 
Xyratex

Sponsored Links

Sponsored Whitepapers

Parallel Performance of the IMSL C Numerical Library with OpenMP

05/21/2013 | Rogue Wave Software

Download whitepaper containing benchmark results depicting the speedup achieved as a result of incorporating OpenMP directives in the IMSL C Numerical Library, for portable, cross platform analytics.

Download this Whitepaper...

Best Practices in Big Data Storage - Sponsored by Cleversafe, Cray, DDN, NetApp, & Panasas

05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas

From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

SGI President and CEO, Jorge Titinger, on Big Data

SGI President and CEO, Jorge Titinger, talks about SGI's history and leadership in HPC and how that has converged into Big Data Solutions.

View Multimedia

Cray CS300-AC Cluster Supercomputer Air Cooling Technology Video

The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.

View Multimedia

More Multimedia



Job Bank

Datanami Conferences Ad

Featured Events

May 22-23, 2013
Business Intelligence Innovation Summit
Chicago, IL
United States

June 4-4, 2013
The Economist's Information Forum
San Francisco, CA
United States

June 10-13, 2013
Cloud & Big Data Expo
New York City, NY
United States

June 19-20, 2013
GigaOM Structure
San Francisco, CA
United States

June 26-27, 2013
2013 Hadoop Summit
San Jose, CA
United States

June 26-27, 2013
Big Data World Congress
London
United Kingdom

» View/Search Events

» Post an Event