How Ford is Putting Hadoop Pedal to the Metal
As the buzz around big data pushes company leaders to grill IT and R&D departments on what the value might be, more technology teams are being steamrolled into new territory.
In some cases, especially for smaller IT shops, if the opportunity presents itself to get to sandbox some new cluster or software toys, grumbling will be minimal. But when it’s global infrastructure–and its already stretched thin, new solutions have to be spun into production-ready resources. This requires careful looking before leaping.
For some, especially at the top of the Fortune 500 heap, it can be an uphill climb to implement these new technologies, even when the ROI is clear. Existing infrastructure, applications, silos and other barriers can bar turning new tech into reality. According to Michael Cavaretta, data science lead at Ford Motor, when his research team is able to prove real value for new architectures or approaches, Ford will fire on all cylinders to act. The problem is not just in that big picture integration—it’s about the overall alue prop.
In a recent conversation, Cavaretta told us about Ford’s experimental “12-or-so” node Hadoop cluster his research and engineering team has been testing. The cluster was strung up around two years ago and since then it’s produced some real value for a varied set of applications, even though many are still in proof of concept stages.
The experiment has been working out so well that Ford is looking to rev up its Hadoop ride and drive it into a larger-scale environment.
The data science lead described in detail what considerations are foremost in their minds as they evaluate the best way to proceed. He said that while they know the value of using it as part of their overall data environment, there are still a few things that aren’t clear as they relate to justifying the cost and value.
Like many other companies looking to Hadoop, the big question at the outset is whether it’s more valuable to build or buy. More specifically, whether or not to go with a distro or appliance “end-to-end” solution versus the trickier option of turning their engineering research folks into MapReduce and Hadoop experts and roll their own.
Cavaretta says that so far, they’ve been able to manage the Apache Hadoop experiment without hiring a bevy of experts, but again, with great growth comes great complexity–and sometimes skilled pros are needed.
Hadoop engineers don’t come cheap, but then again, neither do the fully packaged Hadoop cabinets or large-scale support for distros. “It’s all about what presents us with real value,” said Cavaretta. But with so many unknowns, how can a company as large as Ford navigate ROI for experimental tech?
“We’re still in the evaluation phase,” Cavaretta explained, saying that appliance and packaged, supported Hadoop distros aren’t being ruled out. What’s missing from what they’re being pitched on the appliance and distro side (or at least might not offer the clear-cut value prop over rolling their own Hadoop cluster) is the flexibility to snap these ready-made solutions into their heavily open source and algorithmically tailored infrastructure.
“For us, it all boils down to functionality; big data is a problem that we need to get over to get to the analytics, but we need flexibility and ease of use so we can just get right to extracting the data and applying our analytics on the backend.”
At the core of this statement is what will really differentiate vendor offerings for Ford’s Hadoop engine; Cavaretta says their main questions of vendors, whether on the hardware, appliance or distro side is “what kinds of algorithms are built in, how extensive are those, how much flexibility do we have and what’s the value proposition.”
He told us that “vendors keep saying that they’ll give us the end-to-end, but that becomes a real cost question. We’re still looking at the best ways to derive value.”
Ford’s research and engineering division that focuses on the big data front is an open source shop as well as a purpose-built algorithms one. What worries Cavaretta is that they would spend a great deal of money to buy built-in Hadoop with an appliance or supported distro when they could just rely on their own researchers who could make custom tweaks according to application needs. The question for vendors then is, how open is open?
Outside of this flexibility angle on the vendor decision side, another functionality problem for turning their big data experiments into realities lies in integration in particular, says Cavaretta. With their disparate, massive datasets, the challenge is not just getting data into where it needs to go, but doing the mashups between internal and external data. Taking that data and pulling it into the overarching picture is the goal—but this is far easier said than done. There is a lot of work on the data cleansing, transformation and processing sides that needs be done before Hadoop can work into the overall infrastructure.
As one might imagine, Ford has a massive, distributed and varied data environment to begin with—and is creating new data wells with a slew of new connected machines, including the end product, vehicles. For example, Ford’s modern hybrid Fusion model generates up to 25 gigabytes per hour—all data that is a potential goldmine for Ford, as long as it can find the right pickaxes for the job.
He noted that the real emphasis to adopt so-called big data technologies has been a driving force at Ford for about a year, but his team has been able to get the high-level view of all the data sources and how the complex analytics puzzle that snaps together across the whole organization looks. This, as well as the need to process data in new ways, was the impetus for looking to Hadoop, which originally started within Ford’s research-driven arm of IT. Cavaretta says the team started to see that there were some specific problems that fit well into the Hadoop and big data paradigm—and that story is still unfolding.
When it comes to big data infrastructure overall for the entirety of Ford’s data science and analytics arm, Cavaretta says they’re looking at custom hardware from SGI, EMC Greenplum and more traditional analytics providers like SAS. He also noted that cloud-based approaches are appealing to Ford in new ways and offer some advantages for a number of the analytics operations across the company.
In addition to their high performance computing cluster of several thousand nodes, which Cavaretta says is handily located right next door in case his team’s data science projects get out of hand, they are looking at a number of other purpose-driven hardware and software vendor options. The massive HPC cluster generally crunches on a lot of the CAD, CAE and overall manufacturing simulations and apps, but Cavaretta sounded inspired by the hardware end of the big data equation. “My team is working on the bleeding edge stuff,” he said, noting that they are always experimenting to see if they can prove value.
“One of the positives of working in a group like ours is that we cover the entire company with analytics. That view across the entire enterprise gives us a different perspective on where the data opportunities are and then marrying that with all the open data aspects from the web. We just see so many opportunities, especially with the mashups of the different data types.
It’s quite an opportunity to get a first-hand account of how Fortune 100s are making investment decisions in Hadoop–what their concerns are, how they’re differentiating among vendors, and how they are looking to their own internal people to subvert both vendors and distros. We’ll try to check in again with Michael this time next year to see how much that 12-node cluster baby has grown.