Imagine if every company—from retailers to banks to healthcare outlets—was able to conjure, in real-time, what mattered most to users or consumers.
While large-scale in-house predictive analytics can deliver this in a focused way for some organizations, it’s not available to everyone at a price point that is easily swallowed. Additionally, the algorithmic complexity and scale can be a barrier, unless of course, you’re Twitter and your very purpose is to provide detailed trends for one of the world’s largest communities of users.
The trending topics feature is a valuable Twitter staple, delivering users a near-perfect list of ten topics based on interests and locations—not to mention an instant view of news that is breaking big. Princeton researcher and slinger of data mining algorithms and Twitter, Kostas Tsioutsiouliklis explained how Twitter comes up with those trending topics in a lecture at UC-Berkeley, focusing on the algorithms and sort methods that come up with the ten most relevant topics out of millions.
Before going into what specific algorithms and formulas Twitter uses, it is important to establish parameters. Whatever Twitter does needs to happen quickly. These trending topics change every couple of minutes as new events happen and new memes are created. They also need to account for location and personality.
Canadian pop sensation Justin Bieber is a popular Twitter figure. So too is former mercurial football player Chad Ochocinco/Johnson. At any given time, those two could count themselves among the top trending Twitter topics if Bieber is giving a convert in New York while Ochocinco/Johnson is getting married to a reality television star he met a few months before.
While there will be some crossover, it is unlikely that the majority of Bieber followers also follow Chad and vice versa. Thus, Twitter had to come up with something
According to Tsioutsiouliklis, the method that determines which users should see which trending topics correlates with the method that clumps topics together at an event. This method is called ‘online clustering’ and is accomplished through two different algorithms: TFIDF and MinHash.
TFIDF stands for Term Frequency/Inverse Document Frequency and is actually a relatively simple mathematical function. The function boils down to taking the logarithm of term frequency with respect to terms and documents, adding one to it, and multiplying it by the inverse document frequency. Put simply, the key terms of a group of documents can be found by simply identifying their frequency versus the document count.
As Tsioutsiouliklis explains in the video below, this becomes problematic when the document base is expanding with as great a rate as Twitter’s. The TFIDF algorithm, while simple to understand, is slow and difficult to execute when so many documents are coming in at a time.
This is where an algorithm called MinHash comes in. MinHash is a similarity detection algorithm that assigns hexadecimal storable signatures to each term and determines if those signatures happen with frequency. These signatures determine which cluster a given document should belong to. If certain signatures are consistently being grouped together, like President Obama and Hurricane Isaac or Brad Pitt and Angelina Jolie, those things can be grouped together and co-trend as an event.
Similarly, those signatures can be placed in larger categories, such as sports, North Carolina, or pop music. Thus, an American soccer fan should not get too excited when eight of the ten nationwide trending topics deal with the European Championships in June. It is more an indication that that person cares about soccer (and Twitter gathers this from simply looking at the content of the individual’s tweets and the people that they follow) rather than the entire country.
How does Twitter know which terms are popular in the first place? Simple counting will not do, as common words like “the” and “a” would come up too often. Even if a dictionary were compiled of those simple terms, eliminating them from contention, others are missed, such as contractions and partial phrases.
Something like a ratio that compares the average use of a word versus the current usage is more accurate, but also subject to folly. For example, if a topic is barely mentioned, say five times a day, but is mentioned 25 times on a particular day, that topic would be more “trending” than one whose mentions went from 5,000 to 7,000.
The solution, according to Tsioutsiouliklis, was to create a probability distribution model and chi-square test it. This means that, based on the past month’s worth of tweets, Twitter creates an evolving probability distribution of what percent of tweets they expect to see on a certain subject. They then compare that to the observed result, square the difference (eliminating signs), and are provided with a chi-square score for the topic.
For example, Twitter might expect 30% of a given day’s tweets to be about President Obama. If Obama says/does something inspiring or ridiculous and he ends up occupying 50% of the day’s tweets, his (Expected-Observed)2 would be 400. On the other hand, if the New York Football Giants win a game, pushing their observed from 3% to 5%, its score would be significantly lower, even though it went up proportionally the same.
This method also takes care of random Twitter memes that prop up throughout the day, at least from a national or worldwide perspective. In Greensboro, North Carolina, however, the meme #10PeopleImGladIMet is one of the top trending topics. As Tsioutsiouliklis explains, this has a lot to do with a meme’s not having existed before in any form before, skewing its ratio. While the square-chi formula eliminates that nationally, they can still thrive in places where the overall number of tweets is relatively small. However, according to Tsioutsiouliklis, this is not always a bad thing.
One problem that persists that is difficult to head off are periodic trends. For example, #FF (Follow Friday) defeats its daily average quite handily on Fridays. Further, many Tweeters like to say “good morning” to the Twitterverse when they wake up, propping up “Good morning” stats every day. A solution to this would be to, for each individual topic, create a model of when those periods happen and rate the mentions against the periodic peaks instead of the overall average.
However, one then invariably ends up doing this for nearly every topic. Tsioutsiouliklis noted that they tried to do this and succeeded at Yahoo, but it was very expensive and involved a Hadoop cluster.