One Approach to Boosting Data Accuracy at Scale
Big data platform vendors make the case that “data overload” driven by the proliferation of mobile devices along with the emergence of Internet of Things (IoT) sensor networks are making it harder than ever to extract accurate information from this data waterfall. Even harder is connecting the dots to provide analyses that can be used to make reasoned business decisions.
Hence, analytics vendors like Microsoft partner Webtrends are stressing the need in recent product releases for “achieving data accuracy at scale.” The company touts its new big data platform called “Infinity” as ready to tackle the IoT data overload via an “object-centric big data store.” That approach, Webtrends data scientist Ethan Dereszynski argues in a blog post, extends “beyond web analytics and visitor behaviors to collecting and understanding interactions from devices, sensors and any other networked object.”
Added Dereszynski: “With these high volumes of very granular data, one of the challenges… is how to interrogate such large datasets and produce results that are trustworthy, at scale, with high performance and flexibility.”
Leveraging Apache Spark and other open source tools, Webtrends said it has been ingesting client data into its big data platform since 2014. The company claims its can make data available to clients for analysis as soon as it is collected. “The next step here is to make this data available in real-time for analysis and reporting,” noted Peter Crossley, Webtrend’s director of product architecture and technology, in a separate blog post.
“This is not limited to just certain data or key metrics, but all collected data available for exploration in real-time, which is not that far away,” Crossley predicted.
Along with Apache Spark, the company also is leveraging Hadoop, HDFS and Apache Samza, the distributed stream-processing framework.
Crossley said Webtrends provides customers with data extracts, “but the latency between data collection and data availability is now too long.” The company’s goal has been help customers “shorten the window between the time a visitor buys a product on the website to the time when that visitor level record is available for consuming into a client’s customer intelligence system.”
It is also adding new applications and encrypted data connectors designed to deliver real-time insights to clients about customer purchases.
Meanwhile, Dereszynski stressed that the Infinity platform’s query engine uses an algorithm variant to boost the accuracy of data at scale. Specifically, it addresses limitations in individual processors only able to determine values that are “distinct” among their own partitions of data. Pushing all data onto a single machine is a non-starter since it would quickly overwhelm memory.
Webtrends claims the algorithm variant in its query engine yields “an approximation to the true count of distinct values derived by inspecting all of the data in the desired time range,” Dereszynski explained (emphasis in original).
Webtrends, Portland, Ore., announced earlier this month that London Gatwick Airport would use its analytics platform based on Microsoft’s Sharepoint collaboration and document management platform to measure how airport employees use intranet.
Along with Microsoft (NASDAQ: MSFT) and airlines, the company also counts Toyota (NYSE: TM) and Merck (NYSE: MRK) among its customers.