DataTorrent
Language Flags

Translation Disclaimer

HPCwire Enterprise Tech HPCwire Japan
Leverage Big Data'14

March 16, 2013

How Ford is Putting Hadoop Pedal to the Metal


As the buzz around big data pushes company leaders to grill IT and R&D departments on what the value might be, more technology teams are being steamrolled into new territory.

In some cases, especially for smaller IT shops, if the opportunity presents itself to get to sandbox some new cluster or software toys, grumbling will be minimal. But when it's global infrastructure--and its already stretched thin, new solutions have to be spun into production-ready resources. This requires careful looking before leaping.

For some, especially at the top of the Fortune 500 heap, it can be an uphill climb to implement these new technologies, even when the ROI is clear. Existing infrastructure, applications, silos and other barriers can bar turning new tech into reality. According to Michael Cavaretta, data science lead at Ford Motor, when his research team is able to prove real value for new architectures or approaches, Ford will fire on all cylinders to act. The problem is not just in that big picture integration—it’s about the overall alue prop.

In a recent conversation, Cavaretta told us about Ford’s experimental "12-or-so" node Hadoop cluster his research and engineering team has been testing. The cluster was strung up around two years ago and since then it’s produced some real value for a varied set of applications, even though many are still in proof of concept stages.

The experiment has been working out so well that Ford is looking to rev up its Hadoop ride and drive it into a larger-scale environment.

The data science lead described in detail what considerations are foremost in their minds as they evaluate the best way to proceed. He said that while they know the value of using it as part of their overall data environment, there are still a few things that aren’t clear as they relate to justifying the cost and value.

Like many other companies looking to Hadoop, the big question at the outset is whether it's more valuable to build or buy. More specifically, whether or not to go with a distro or appliance "end-to-end" solution versus the trickier option of turning their engineering research folks into MapReduce and Hadoop experts and roll their own.

Cavaretta says that so far, they've been able to manage the Apache Hadoop experiment without hiring a bevy of experts, but again, with great growth comes great complexity--and sometimes skilled pros are needed.

Hadoop engineers don't come cheap, but then again, neither do the fully packaged Hadoop cabinets or large-scale support for distros. "It's all about what presents us with real value," said Cavaretta. But with so many unknowns, how can a company as large as Ford navigate ROI for experimental tech?

“We’re still in the evaluation phase,” Cavaretta explained, saying that appliance and packaged, supported Hadoop distros aren’t being ruled out. What’s missing from what they’re being pitched on the appliance and distro side (or at least might not offer the clear-cut value prop over rolling their own Hadoop cluster)  is the flexibility to snap these ready-made solutions into their heavily open source and algorithmically tailored infrastructure.

“For us, it all boils down to functionality; big data is a problem that we need to get over to get to the analytics, but we need flexibility and ease of use so we can just get right to extracting the data and applying our analytics on the backend.”

At the core of this statement is what will really differentiate vendor offerings for Ford’s Hadoop engine; Cavaretta says their main questions of vendors, whether on the hardware, appliance or distro side is “what kinds of algorithms are built in, how extensive are those, how much flexibility do we have and what’s the value proposition.”

He told us that “vendors keep saying that they’ll give us the end-to-end, but that becomes a real cost question. We’re still looking at the best ways to derive value.”

Ford’s research and engineering division that focuses on the big data front is an open source shop as well as a purpose-built algorithms one. What worries Cavaretta is that they would spend a great deal of money to buy built-in Hadoop with an appliance or supported distro when they could just rely on their own researchers who could make custom tweaks according to application needs. The question for vendors then is, how open is open?

Outside of this flexibility angle on the vendor decision side, another functionality problem for turning their big data experiments into realities lies in integration in particular, says Cavaretta. With their disparate, massive datasets, the challenge is not just getting data into where it needs to go, but doing the mashups between internal and external data. Taking that data and pulling it into the overarching picture is the goal—but this is far easier said than done. There is a lot of work on the data cleansing, transformation and processing sides that needs be done before Hadoop can work into the overall infrastructure.

As one might imagine, Ford has a massive, distributed and varied data environment to begin with—and is creating new data wells with a slew of new connected machines, including the end product, vehicles. For example, Ford’s modern hybrid Fusion model generates up to 25 gigabytes per hour—all data that is a potential goldmine for Ford, as long as it can find the right pickaxes for the job.

He noted that the real emphasis to adopt so-called big data technologies has been a driving force at Ford for about a year, but his team has been able to get the high-level view of all the data sources and how the complex analytics puzzle that snaps together across the whole organization looks. This, as well as the need to process data in new ways, was the impetus for looking to Hadoop, which originally started within Ford’s research-driven arm of IT. Cavaretta says the team started to see that there were some specific problems that fit well into the Hadoop and big data paradigm—and that story is still unfolding.

When it comes to big data infrastructure overall for the entirety of Ford’s data science and analytics arm, Cavaretta says they’re looking at custom hardware from SGI, EMC Greenplum and more traditional analytics providers like SAS. He also noted that cloud-based approaches are appealing to Ford in new ways and offer some advantages for a number of the analytics operations across the company.

In addition to their high performance computing cluster of several thousand nodes, which Cavaretta says is handily located right next door in case his team’s data science projects get out of hand, they are looking at a number of other purpose-driven hardware and software vendor options. The massive HPC cluster generally crunches on a lot of the CAD, CAE and overall manufacturing simulations and apps, but Cavaretta sounded inspired by the hardware end of the big data equation. “My team is working on the bleeding edge stuff,” he said, noting that they are always experimenting to see if they can prove value.

“One of the positives of working in a group like ours is that we cover the entire company with analytics. That view across the entire enterprise gives us a different perspective on where the data opportunities are and then marrying that with all the open data aspects from the web. We just see so many opportunities, especially with the mashups of the different data types.

It's quite an opportunity to get a first-hand account of how Fortune 100s are making investment decisions in Hadoop--what their concerns are, how they're differentiating among vendors, and how they are looking to their own internal people to subvert both vendors and distros. We'll try to check in again with Michael this time next year to see how much that 12-node cluster baby has grown.

Related Articles

Ford Looks to Hadoop, Innovative Analytics

Six Super-Scale Hadoop Deployments

Cloudera CTO Reflects on Hadoop Underpinnings

Intel CIO’s Big Data Prescription

Share Options


Subscribe

» Subscribe to our weekly e-newsletter


Discussion

There are 0 discussion items posted.

 

Most Read Features

Most Read News

Most Read This Just In

Cray Supercomputer

Sponsored Whitepapers

Planning Your Dashboard Project

02/01/2014 | iDashboards

Achieve your dashboard initiative goals by paving a path for success. A strategic plan helps you focus on the right key performance indicators and ensures your dashboards are effective. Learn how your organization can excel by planning out your dashboard project with our proven step-by-step process. This informational whitepaper will outline the benefits of well-thought dashboards, simplify the dashboard planning process, help avoid implementation challenges, and assist in a establishing a post deployment strategy.

Download this Whitepaper...

Slicing the Big Data Analytics Stack

11/26/2013 | HP, Mellanox, Revolution Analytics, SAS, Teradata

This special report provides an in-depth view into a series of technical tools and capabilities that are powering the next generation of big data analytics. Used properly, these tools provide increased insight, the possibility for new discoveries, and the ability to make quantitative decisions based on actual operational intelligence.

Download this Whitepaper...

View the White Paper Library

Sponsored Multimedia

Webinar: Powering Research with Knowledge Discovery & Data Mining (KDD)

Watch this webinar and learn how to develop “future-proof” advanced computing/storage technology solutions to easily manage large, shared compute resources and very large volumes of data. Focus on the research and the application results, not system and data management.

View Multimedia

Video: Using Eureqa to Uncover Mathematical Patterns Hidden in Your Data

Eureqa is like having an army of scientists working to unravel the fundamental equations hidden deep within your data. Eureqa’s algorithms identify what’s important and what’s not, enabling you to model, predict, and optimize what you care about like never before. Watch the video and learn how Eureqa can help you discover the hidden equations in your data.

View Multimedia

More Multimedia

ISC'14

Job Bank

Datanami Conferences Ad

Featured Events

May 5-11, 2014
Big Data Week Atlanta
Atlanta, GA
United States

May 29-30, 2014
StampedeCon
St. Louis, MO
United States

June 10-12, 2014
Big Data Expo
New York, NY
United States

June 18-18, 2014
Women in Advanced Computing Summit (WiAC ’14)
Philadelphia, PA
United States

June 22-26, 2014
ISC'14
Leipzig
Germany

» View/Search Events

» Post an Event