ML Powers Discovery In GE’s 500 PB Lake
Like most Fortune 50 firms, General Electric relies on an abundance of computer systems to power its enterprise. And like most firms that size, synching up and aligning the data emitted by different systems is a major challenge. But thanks to an innovative data discovery solution powered by machine learning, GE found a solution.
GE’s Hadoop-based data lake contains 500 PB of data that originated from about 120 different systems, according to Diwakar Goel, the VP and Chief Data Officer of GE Digital and Finance. Data is sourced from a variety of ERP packages, accounting systems, and other applications, such as Ariba, Concur, and Salesforce.com. Even LinkedIn and Twitter data makes it into the lake for downstream sentiment analysis.
As Goel explained to Datanami during a briefing at the recent Strata Data Conference, getting data into Hadoop is one thing. But getting actual value out of the data – that is, building successful data science and data analytic products — is a much tougher challenge.
“Ingesting the data is the easiest part,” he said. “I have the data from all of the sources. But now I have 130,000 entities in my data. Now, how do you identify all the relationships in all those entities?”
There are things that customers know about their data, and there are things they don’t. “If you ingest so much data, you are ingesting to find insights in the data that you don’t know of,” Goel continued. “That’s where you cannot rely on manual techniques to identify it and give you those insights. That’s why you use machine learning.”
GE found a potential solution to the problem in Io-Tahoe, a New York City-based data management startup that emerged from Centrica, a £28-billion company that owns British Gas and other subsidiaries. Io-Tahoe has developed a data discovery tool that uses patent-pending machine learning technology to determine the relationships between disparate pieces of data.
According to Rohit Mahajan, who co-founded Io-Tahoe with Oksana Sokolovsky and is its chief technology/product officer and co-founder, this class of software needs machine learning technology because it’s the only way to solve the challenge data discovery challenge at big data-scale.
“We are curing the root cause, not the symptom,” he said during the Strata meeting. “Brute forcing the data is big problem.”
“Brute forcing” the problem is a reference to traditional approaches to master data management (MDM), whereby humans help classify the data and keep it straight in a master index. That just doesn’t work anymore for a company like GE, which generates 5 million transactions per minute.
“What’s the other option?” Mahajan said. “I call it death by Excel. That’s what organizations are doing today.”
The smart data discovery approach lets GE harmonize its operational data, even if the data originated in different types of databases on different sides of the world. For example, during the order-to-cash process, a client might generate a slew of documents — purchase orders, invoices, and receipts – across a variety of steps. Those steps may execute in a handful of different accounting systems, which complicates matters for executives looking to impact the bottom line.
“You need to tie your relationships from the 10th element to the first element,” Goel said. “That’s where the fundamental root is. That’s what Io-Tahoe is trying to do with smart discovery. Once you solve that, you have 20 different applications.”
GE has a team of more than 30 data scientists who are tasked with finding insights in data and exploiting them with business processes. As the CDO, Goel is using Io-Tahoe to build data models that target specific domains, such as for general ledger.
“You are basically building a data model that makes it agnostic which ERP the data came from,” he said. “Because most people today are not struggling to ingest the data. Most people are struggling to make sense out of the data once they ingest it.”
The company also uses Tamr for targeted data preparation work, but it leans heavily on Io-Tahoe to do the heavy lifting when it comes to determining complex and sub-sectional relationships among disparate data sets.
“Having data profiled in a meaningful way really saves that time and it accelerates your algorithm building timeframe,” Goel said. “They really make us far more efficient on how we consume the data.”
Data discovery is often linked with data cataloging. After all, you can’t build a catalog if you don’t know where the data is. Io-Tahoe launched its own data catalog earlier this year, but GE uses a separate offering from Alation.
“People want to build products [using] data science algorithms,” Goel said. “They have ingestion of data products. They have nothing in between. Unless you do this in between, there’s no way you can build something and scale it out.”
Editor’s note: This article has been corrected. Io-Taho is headquartered in the US, not the UK. Datanami regrets the error.