Customizing the Internet, One User at a Time
You’ve no doubt heard the statistic that a full 90 percent of all the data in the world has been generated in the last two years. In 2012, humans created 2.5 quintillion bytes of data every day. Every minute of every hour, Google’s servers process more than 2 million search queries, Facebook scans 3.5 terabytes of data, and some 277,000 tweets are sent. The global Internet population now represents 2.1 billion people all contributing to the ever-widening digital footprint.
With great swarms of data coming and going in all directions how does one go about finding the content that they want to see? How do sites like Yahoo and Facebook know what articles and ads will get people’s attention?
To help address questions like these, Lehigh University researchers developed a technique that studies the behavior of social media users. Based a small sample of online activity, the researchers were able to predict the types of content users would like to see.
“We process terabytes of data every hour,” says Liangjie Hong, who earned his PhD in computer science at Lehigh University and is now a research scientist at Yahoo Labs. “You cannot consume it all.”
Brian Davison, associate professor of computer science and engineering and head of Lehigh’s Web Understanding, Modeling and Evaluation (WUME) laboratory, concurs. For the engaged social media user, it’s nearly impossible to keep up with all the feeds and messages coming in, he says. Davison knows about social media overload first hand. Currently on sabbatical from Lehigh, he is working in the data science group at Facebook, where being a social networking power user comes with the territory.
The project has been evolving since 2010. First, Davison and Hong developed an algorithm to predict how often the recipients of tweets would pass along (aka “retweet”) messages to their own followers. For this effort, the researchers were awarded best poster paper at the 2011 World Wide Web Conference.
The then changed their focus to analyze how a user responded to incoming information. “If we could record a user’s activities for 24 hours,” says Hong, “we would know exactly what they are looking for.”
They developed co-factorization machines that use a mathematical analysis method to examine how social media users interact with tweets. For example, do they reply or reweet or mark as favorite? Do they reply? Do they retweet? Which tweets do they mark as favorites? The technique would also expose user interests based on, for example, the frequency with which certain terms appeared in their feeds.
“If we can better understand what you are interested in,” says Davison, “we can decide what to filter, rank higher, or flag for your attention.”
Davison and Hong developed the algorithms using a machine learning approach. Instead of explicitly programming rules to assess how users respond to tweets, the algorithms are trained with data sets. In this case, the algorithms used past interactions to build and refine rules for individual users. A published description of their work was a finalist for best paper award at ACM’s Sixth International Conference on Web Search and Data Mining (WSDM) in Rome in 2013.
The project has many implications, from refining social media feeds so you don’t miss out on the feeds and messages you really want to see to helping news outlets provide personalized content.
Hong explains that while finding patterns in terabytes of data is a tremendous challenge, mining smaller data flows for individual patterns can be done more easily. A specific user is only interested in a tiny fraction of the Webosphere, so why not navigate the problem one narrow slice at a time? That’s what their approach does.
Recognizing the potential for customization to intensify the “filter bubble” effect, in which personalized news feeds create an echo chamber that omits important information and differing viewpoints, Davison is also undertaking another study to examine the way that people perceive bias in online news.
For now, Web companies will continue to deliver content based on a mixture of personalization and popularity, and it will be up to users to diversify their news feeds.