Highlighting Business Signals on the Noisy Web
Twenty years ago, when the Web was still young, trying to extract useful information was a hit and miss affair. Today, the Web is multiple orders of magnitude bigger, and yet most of the data is completely useless for informing business decisions. To get gain a competitive edge despite the clutter, organizations are adopting a new breed of information service that leverages powerful data analytics to extract useful bits from the Web’s wild fire hose of data.
Penny Herscher is the CEO of one such information service provider, called FirstRain. The company gives its clients in sales and marketing organizations useful news about the world’s businesses, which it gleans by crawling the Internet for new pieces of information. Sources include the entire Twitter stream, social media sites, company websites, blogs, industry journals, news sites, real estate listings, and regulatory filings.
Weeding out the informational chaff from these sources is a FirstRain specialty. “The key to our technology is the signal to noise ratio, and getting rid of the noise,” Herscher tells Datanami in a phone interview. “Our solution takes the Web and social media, and analyzes it and breaks it down using NLP [natural language processing] algorithms to find out what’s changing about a business.”
FirstRain is not the only company analyzing the Internet using advanced text analytics algorithms and technologies. But it may be one of the more successful ones, considering that Dun & Bradstreet today announced that it is now offering FirstRain’s business analytic news stream as part of its Hoover’s, D&B 360, D&B Direct, and FirstResearch solutions. FirstRain subscriptions normally go for more than $80,000 per year.
FirstRain maintains an extensive IT operation dedicated to finding better and more efficient ways to extract information about millions of business businesses, and present it to clients, who are typically professionals in Fortune 2000 companies. Despite the incredible data processing capabilities of computers, they still cannot “think” like a human, and therefore it takes some creative programming to be able to segment and categorize Internet content–or any kind of content, for that matter–in a way that is useful to people.
Twitter, for example, is a potentially valuable source of actionable business information. For that reason, FirstRain subscribes to the entire Twitter stream, via Gnip. However, very few of the millions of tweets made every day are relevant to FirstRain subscribers, Herscher says.
“The only way to look at Twitter is it’s just noisy. It has all the crap in there,” she says. “We actually remove 99.8 percent of Tweets because they’re just junk. It’s people saying ‘I lost my luggage’ or ‘My Starbucks is too hot.’ What we do is say, ‘Here are the three or four tweets that you really need to see to understand how your prospect is being talked about on Twitter.'”
Behind the fancy GUI presentations are several levels of technology doing the dirty work for FirstRain. This includes intelligent crawlers to discover updates to web pages; machine reading algorithms to identify key words; relational and NoSQL databases to store the documents; heuristic processes to apply rules; artificial neural network; and machine learning algorithms to refine the models.
Inside the FirstRain Service
Much of FirstRain’s data analytics challenge surrounds determining the “aboutness” of a given piece of new information, says FirstRain COO YY Lee.
“This whole idea of taking a fingerprint of aboutness and understanding aboutness–there’s tons of software that’s involved,” Lee says. “You need really fast categorizers. You need combination of sequential and parallel processing. You need heuristics as well as black-box data science and semantic algorithms that work on it. But then on the analytics side, you need equal firepower on the algorithms that can weigh aboutness.”
While one algorithm may work well for describing the retail industry, for example, the same algorithm may do very poorly in the media market. “There’s an entire set of data stores that are simply about storing the dozens of techniques that we use to understand aboutness, and to weight what things are actually about,” Lee says.
FirstRain creates a “structural mental model” of companies and their relationships, which helps the firm determine the relevancy of new pieces of data and how they describe what’s happening inside one company or between two or more companies. The software can also update itself based on the information collected, such as a company in the retail industry suddenly sharing information about something outside of that industry.
|A screenshot from Hoover’s featuring FirstRain info on Chevron|
“The software challenge was to create something that was flexible but usable, and extensible enough so it could actually reflect all the different types of structures and relationships,” Lee says. “The good news is lots of people are doing this kind of thing, so there’s a lot of great open source academic work to pull on.”
The company doesn’t store any of this information. Instead, it keeps it long enough to gauge its relevancy, and then gives its clients pointers back to the information. And the information is only about businesses; leave the job of archiving and modeling the contents of the entire Internet to the Google’s and the Facebook’s of the world, Lee says.
Not every relevant piece of information about a given company is going to be presented in FirstRain. There are lots of things that the company makes no effort to understand, such as ticket prices, lost luggage complaints of United Airlines customers, or why Mylee Cyrus is popular. “But what we do deeply understand about is what’s going on in companies, what’s implied as going on and between companies, and therefore what’s happening between industries,” Lee says. “We know what’s trending. We know what’s different from the past. We know what’s clustering and what’s emerging.”