The Graph That Knows the World
Somewhere in a data center in Fremont, California, exists a large computer cluster that’s hoovering up every piece of data it can find from the Web and using machine learning algorithms to find connections among them. It’s arguably the largest known graph database in existence, encompassing 10 billion entities and 10 trillion edges.
No, it’s not some secret government project to catalog the world’s information. In fact, the graph was created and is run by a private company called Diffbot, and in fact you can get access to it for as little as $300 per month.
You can’t accuse Mike Tung, the founder and CEO of Diffbot, of thinking small, or beating around the bush for that matter. During an interview last week, he got right to the point. “The purpose of our company,” he tells Datanami, “is to build the first comprehensive map of all human knowledge.”
That might sound like a crazy thing to do, in 2018, a quarter century after the Web went mainstream, after the first dot-com crash, the rise of Web 2.0, the emergence of e-commerce 3.0, and the forthcoming industry 4.0 wave that’s projected to shake it all lose again. Haven’t we done this already? And isn’t that what Google and Wikipedia are for?
Not according to Tung, who started work on the Diffbot graph while at Stanford University in 2008 and then started the Diffbot company in 2011. While it’s true that Google and Wikipedia are creating large knowledge graphs, they’re not as useful as one might think, Tung says.
“Our knowledge base is not only larger, deeper and more accurate [than Google’s and Wikipedia’s] but it’s accessible and more useful,” Tung says. “We hope that this is the first step in creating a future where…you have almost infinite access to knowledge.”
Tung says that what makes Diffbot unique, apart from its size and public nature, is how it’s assembled. While Google and Wikipedia rely largely on human labor to curate the information that goes into their graphs – and Facebook relies on its 2 billion users to create its knowledge graph — the Diffbot graph is created automatically — autonomously, really — through a variety of machine learning techniques, including computer vision, natural language processing (NLP), and others.
The Diffbot knowledge base currently has 10 billion vertices, which correspond to entities, including people, places and things. Connecting those 10 billion entities are 10 trillion edges, which are facts that can be searched through an API or DQL, the SQL-like Diffbot Query Language.
Every month, the Diffbot crawlers and AI bots span out across the World Wide Web’s 70 million Web pages, and identify 100 million new entities, which are added to the graph. It also crawls the Deep Web and the Dark Web; the Deep Web results are added to the WWW bucket, while the Dark Web result are kept separate, Tung says.
“We do a full crawl of the Web. We’re one of the few US entities that does full Web crawling, the other being Google and Bing,” Tung says. “The output of the knowledge graph is a few petabytes, but the input data that it reads to build the knowledge graph is orders of magnitude larger.”
More than 450 companies pay Diffbot for access to its knowledge graph. That includes companies with big web presences, like Bing, EBay, Amazon, Pintrest, Snapchat, Duck Duck Go, Yandex, and Wal-Mart.
Diffbot has been programmed with high-level categories, such as people, and things about people, products, images, and articles. Beyond that, Diffbot has not been pre-programmed to differentiate anything. Instead, the software itself works what’s similar and what’s true among the different objects that it encounters.
During a demo, Tung showed how Diffbot could be used to browse entities, such as stories written by a certain tech reporter and mattresses.
“Only the upper ontology is specified by the engineers at Diffbot,” Tung says. “The lower levels, such as king or queen-sized bed — that was never pre-programmed into it. It crawled all the products on the Web and it learned that hey, a lot of stores categorize their product this way.”
There are several ways to use Diffbot. For example, a salesperson may use it to drum up prospects in certain industries and certain geographies. If the head of HR for a midsize construction firm in Washington State has a public persona – who really, who don’t these days? – then Diffbot can find them and categorize them and surface up their pertinent details for a nominal charge.
If you wanted to see how your favorite Silicon Valley startup is doing in the diversity department, you could instruct Diffbot to deliver the ratio of men to women at a certain company, Tung says.
“If you wanted to compile this by just doing Web research, it would take many, many man months because you have to first get a list of all employees,” he says, “Whereas here, you can synthesize all this information across the Web. It’s run computer vison. It knows from the face whether it’s female or male. It’s combining multiple signals, from the image, the text, and the layout, to understand the facts and properties of entities and then you can do aggregate analysis of it, within milliseconds.”
Web as Unstructured Database
Research shows that knowledge workers spend upwards of 30% of their time just looking for information. Because Diffbot brings a structure to the unstructured data sitting on the Web, it holds the pontifical to automate this data foraging in a way that Google was never designed to do.
“Google is merely a card catalog to the Web,” Tung says. “You type in a query and it says there are 10 million results and then it sorts results by relevance. But the results in Google are really just a pointer or a link to a page that you have to go read to get the information for yourself. Google doesn’t actually help you do that.”
Entity graph databases are not new, and some large companies have built their own knowledge bases to categorize internal information and streamline access to useful data. In fact, some Diffbot customers have combined internal data stored in PDFs, Word documents, email servers, and enterprise applications like CRM and ERP systems to elevate their information accessibility to another level.
As Tung was looking for a way to scale Diffbot, he tried out various off-the-shelf databases. “All the commercial graph databases we tried pretty much crashed when we tried to load data,” he says. “Most of the off-the-shelf graph databases we tested locked up and froze between 10 million to 100 million entities… So we ended up building something proprietary.”
Diffbot builds its own servers, and houses them in a pair of data centers in Fremont. A typical machine is based on a Supermicro or Dell chassis, and includes 48-core processor, 1TB of RAM, and 32 4TB SSDs. The cluster has high I/O requirements, so the servers are equipped with 10GbE Ethernet switches.
Tung says the main graph database spans “a couple tens-of-thousand cores,” according to Tung, which would put the cluster size somewhere around 400 nodes. The database doesn’t run on the cloud because the public cloud provides don’t have machine sizes big enough to handle the workload.
The amount of data the world generates each day is increasing at an almost exponential rate, but our ability to turn it into useful information has not kept up. Techniques such as Diffbot’s hold the potential to streamline that process.
“We really want to democratize access to information,” Tung says. “People think only Google and Facebook have this level of information, and they don’t make their databases public for a variety of reason….They’re working on behalf of advertisers.”
Companies spend a large amount of their time keeping databases up to date and doing data entry. “We think human beings just should not do this kind of work in the future,” Tung says. “It can all be done much better on AI systems, so we can spend time actually leveraging data and analyzing it.”