There are competing views of how we should tackle an abundance of data, which I’ve referred to as big data’s “odd couple”.
One camp—made up of semantic idealists who fetishize taxonomies—is to tag and organize it all. Once we’ve marked everything and how it relates to everything else, they hope, the world will be reasonable and understandable.
The poster child for the Semantic Idealists is Wolfram Alpha, a “reasoning engine” that understands, for example, a question like “how many blue whales does the earth weigh?”—even if that question has never been asked before. But it’s completely useless until someone’s told it the weight of a whale, or the earth, or, for that matter, what weight is.
In Lewis Carroll’s Sylvie and Bruno Concluded (1893), a traveller from another planet, known only as Mein Herr, learns about Earth’s maps.
‘That’s another thing we’ve learned from your Nation,” said Mein Herr, “map-making. But we’ve carried it much further than you. What do you consider the largest map that would be really useful?”
“About six inches to the mile.”
“Only six inches!” exclaimed Mein Herr. “We very soon got to six yards to the mile. Then we tried a hundred yards to the mile. And then came the grandest idea of all! We actually made a map of the country, on the scale of a mile to the mile!”
“Have you used it much?” I enquired.
“It has never been spread out, yet,” said Mein Herr: “the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well.”
The example underscores one of the frustrations of this semantic idealism—that to perfectly tag the world around us requires an effort that approaches the world itself.
The other partner in big data’s odd couple is the chaotic nihilist. She’s abandoned any hope of properly tagging the world, and relies on machines to find the most relevant or appropriate information. Her kind are the machine-learning data scientists who are convinced that given enough data and the right algorithm, the best results will bubble to the top.
Wolfram Alpha’s counterpart for the Algorithmic Nihilists is IBM’s Watson, a search engine that guesses at answers based on probabilities (and famously won on Jeopardy.) Watson was never guaranteed to be right, but it was really, really likely to have a good answer. It also wasn’t easily controlled: when it crawled the Urban Dictionary website, it started swearing in its responses, and IBM’s programmers had to excise some of its more colorful vocabulary by hand.
She’s wrong too.
The future of data is a blend of both semantics and algorithms. That’s one reason Google recently introduced a second search engine, called the Knowledge Graph, that understands queries. Knowledge Graph was based on technology from Metaweb, a company it acquired in 2010, and it augments “probabilistic” algorithmic search with a structured, tagged set of relationships.
We can learn a lot about this blend by considering how accountants look at a cup of coffee. How would you file such a thing? Would you file it under coffee, or cup, or Alistair? The answer, in a physical filing system, is that it depends on how you plan to use it. If you wanted to charge people for their coffee you’d file it by name. If you wanted to compare what kinds of hot drinks people consumed, you’d file it under coffee. And if you wanted to do an inventory of dishware, you’d file it under cups.
Such things have rules. Accountants spend years learning the Generally Accepted Accounting Principles (GAAP) that govern how and where to file things. In their world, if you wanted to do the three kinds of analysis, you’d need three physical copies of the cup, to stuff into three filing cabinets. And then if you changed one cup—say, giving it a price—the other two copies would be out of date.
In a digital age, this example is nonsense. We have what Heidegger would call the fundamental “thingness” of the item being filed, that “around which the properties have assembled.” And then we have those properties: Alistair; Coffee; Cup. We have tags, and we can extend them.
Accounting is still mired in the bog of atoms, rather than soaring with the flexibility of bits, and with it much of how business operates. Many of the tools we rely on today don’t embrace the power of tagging and semantics out of sheer inertia. Given modern technology—relational databases, tagging, and so on—nobody would design the General Ledger or the strictures of GAAP. Yet they persist, and they slow the progress of the semantic idealists, and of data-driven business in general.
About the Author
Alistair Croll is an entrepreneur and technology analyst. He’s worked on web performance, big data, cloud computing, and startups. In 2001, he co-founded web performance startup Coradiant, and since that time has also launched Rednod, CloudOps, Bitcurrent, Year One Labs, the Bitnorth conference, and several other early-stage companies.
Alistair is the author of three books on web performance, analytics, and IT operations. He's also the author of the forthcoming Lean Analytics (www.leananalyticsbook.com) a book on using data to build a better business faster due out in March from O’Reilly Media. Alistair is the chair of O'Reilly's Strata conference (www.strataconf.com), Cloud Connect, and the International Startup Festival. He lives in Montreal, Canada and tries to mitigate chronic ADD by writing about far too many things at Solve For Interesting (www.solveforinteresting.com)
 From The Origin of the Work of Art.