Lessons In Machine Learning From GE Capital
The financial services industry is always on the cutting edge, and so it is with machine learning at GE Capital, the lending and leasing arm of the industrial giant.
Mingzhu Lu, senior data scientist at GE Capital, has worked at various units of the company for the past eight years, applying machine learning to healthcare and other parts of the business, but is now mostly focuses on risk analysis and related applications at GE Capital. At the Hadoop Innovation Summit in San Diego recently Lu revealed the lessons that GE Capital has learned as it has adopted open source machine learning tools.
The first lesson that Lu shared is one that all IT departments should take heed of, regardless of the type of application they are developing and the systems that they plan to use. The first thing you do is establish the benchmark metrics for the application ahead of time so you can hit the targets.
“This seems to be very obvious,” explained Lu, “but whenever we have a lot of machine learning tools and we need to choose or build our own based on the business needs, we design performance metrics based on the running time, accuracy, and costs.”
The benchmarking phase takes a lot of time because there are lots of datasets to explore and many different machine learning algorithms, Lu cautions. Depending on the applications, one of these three – time, accuracy, or cost – will be more important than the others, and you have to pick your tools and the underlying system configuration to match these goals. In some cases, you have money and skills to build your own, in some cases you do not. Sometimes there is a mix of both. For instance, several years ago, GE Capital put a very early release of the Weka machine learning tool to work, but it was not scalable enough for its workload. So its techies took the tool and married it to the Message Passing Interface (MPI) protocol commonly used in supercomputing clusters that do simulation and modeling to parallelize it.
Mingzhu Lu, senior data scientist for
The second lesson that Lu had for Hadoop enthusiasts was to think beyond MapReduce. The reason, she explained, is that lots of machine learning algorithms can be solved by optimization methods or heuristic search, which normally are iterative. One example is gradient descent or stochastic gradient descent, where “the optimization problem is going to be iterative and so a MapReduce approach is not going to be suitable.” To solve this problem means thinking beyond MapReduce, and it means deploying the MPI technique mentioned above or Bulk Synchronous Programming, a programming model that dates from the 1980s and that has been adopted by Google and others for graph analytics in recent years.
“The difference between them is the level of parallelization in the languages,” explains Lu. “MapReduce is easier for those who are not familiar with parallel programming, but with MPI, you have to be familiar with parallel programming and also the fault tolerance is not so good. So when we do the programming, we need to take care of the fault tolerance as well. It is not like Hadoop. The programming complexity is definitely going to increase, but the flexibility will increase a lot because all of the nodes in the cluster can communicate with each other. It is not like BGP or MapReduce, where they have to wait until the reduce stage to communicate their results.”
Not only does Lu expect for GE Capital to use a mix of programming models, but also to use a mix of systems to run its machine learning tools. This includes on premises clusters using standard X86 processors, but also GPU accelerators where appropriate and cloud capacity where the datasets and the running time fit best to that on-demand pricing model and the specific hardware available for running the ML models and applications. GE Capital expects to deploy MapReduce and its follow-on in Hadoop 2.0, Yarn, which allows for other computational and data management schemes to be put inside of the Hadoop framework.
The key is to remain flexible.
“We first used MPI, then we moved to MapReduce, and then later we found that MapReduce is not as suitable for machine learning and we moved back to MPI. But we needed to develop our own scatter system and fault tolerance.”
The final lesson from GE Capital is that some open source machine learning tools are better than others. For Apache Mahout, the machine learning add-on for Hadoop, Lu says that the recommendation and clustering algorithms work fine, but all of the other features have “space for improvement.” GE Capital is doing early tests on GraphLab and Spark, and she says “the results are exciting and inspiring.” The good thing about Mahout, of course, is that GE Capital can grab the open source code and tweak it as it sees fit, and usually when the company does that, it can get a lot more performance out of whatever algorithm it uses.