September 25, 2018

ML Powers Discovery In GE’s 500 PB Lake

Alex Woodie

(retrorocket/Shutterstock)

Like most Fortune 50 firms, General Electric relies on an abundance of computer systems to power its enterprise. And like most firms that size, synching up and aligning the data emitted by different systems is a major challenge. But thanks to an innovative data discovery solution powered by machine learning, GE found a solution.

GE’s Hadoop-based data lake contains 500 PB of data that originated from about 120 different systems, according to Diwakar Goel, the VP and Chief Data Officer of GE Digital and Finance. Data is sourced from a variety of ERP packages, accounting systems, and other applications, such as Ariba, Concur, and Salesforce.com. Even LinkedIn and Twitter data makes it into the lake for downstream sentiment analysis.

As Goel explained to Datanami during a briefing at the recent Strata Data Conference, getting data into Hadoop is one thing. But getting actual value out of the data – that is, building successful data science and data analytic products — is a much tougher challenge.

“Ingesting the data is the easiest part,” he said. “I have the data from all of the sources. But now I have 130,000 entities in my data. Now, how do you identify all the relationships in all those entities?”

There are things that customers know about their data, and there are things they don’t. “If you ingest so much data, you are ingesting to find insights in the data that you don’t know of,” Goel continued. “That’s where you cannot rely on manual techniques to identify it and give you those insights. That’s why you use machine learning.”

GE found a potential solution to the problem in Io-Tahoe, a New York City-based data management startup that emerged from Centrica, a £28-billion company that owns British Gas and other subsidiaries. Io-Tahoe has developed a data discovery tool that uses patent-pending machine learning technology to determine the relationships between disparate pieces of data.

According to Rohit Mahajan, who co-founded Io-Tahoe with Oksana Sokolovsky and is its chief technology/product officer and co-founder, this class of software needs machine learning technology because it’s the only way to solve the challenge data discovery challenge at big data-scale.

“We are curing the root cause, not the symptom,” he said during the Strata meeting. “Brute forcing the data is big problem.”

“Brute forcing” the problem is a reference to traditional approaches to master data management (MDM), whereby humans help classify the data and keep it straight in a master index. That just doesn’t work anymore for a company like GE, which generates 5 million transactions per minute.

“What’s the other option?” Mahajan said. “I call it death by Excel. That’s what organizations are doing today.”

The smart data discovery approach lets GE harmonize its operational data, even if the data originated in different types of databases on different sides of the world. For example, during the order-to-cash process, a client might generate a slew of documents — purchase orders, invoices, and receipts – across a variety of steps. Those steps may execute in a handful of different accounting systems, which complicates matters for executives looking to impact the bottom line.

“You need to tie your relationships from the 10th element to the first element,” Goel said. “That’s where the fundamental root is. That’s what Io-Tahoe is trying to do with smart discovery. Once you solve that, you have 20 different applications.”

GE has a team of more than 30 data scientists who are tasked with finding insights in data and exploiting them with business processes. As the CDO, Goel is using Io-Tahoe to build data models that target specific domains, such as for general ledger.

“You are basically building a data model that makes it agnostic which ERP the data came from,” he said. “Because most people today are not struggling to ingest the data. Most people are struggling to make sense out of the data once they ingest it.”

The company also uses Tamr for targeted data preparation work, but it leans heavily on Io-Tahoe to do the heavy lifting when it comes to determining complex and sub-sectional relationships among disparate data sets.

“Having data profiled in a meaningful way really saves that time and it accelerates your algorithm building timeframe,” Goel said. “They really make us far more efficient on how we consume the data.”

Data discovery is often linked with data cataloging. After all, you can’t build a catalog if you don’t know where the data is. Io-Tahoe launched its own data catalog earlier this year, but GE uses a separate offering from Alation.

“People want to build products [using] data science algorithms,” Goel said. “They have ingestion of data products. They have nothing in between. Unless you do this in between, there’s no way you can build something and scale it out.”

Learning from Your Data: Essential Considerations

Editor’s note: This article has been corrected. Io-Taho is headquartered in the US, not the UK. Datanami regrets the error.

Applications: Data Mining

Technologies: Frameworks

Sectors: Financial Services

Vendors: Io-Tahoe

Tags: algorithms, big data, data discovery, data lake, machine learning, smart discovery

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

ML Powers Discovery In GE’s 500 PB Lake

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Building an Operational Data Warehouse for Real-time Analytics

Can You Use Kafka as a Database?

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

Call & Contact Center Expo

AI & Big Data Expo North America 2024

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

ML Powers Discovery In GE’s 500 PB Lake

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 19, 2024

April 18, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link