H2O Users Share Data Science Stories
When it comes to analytics tools, data scientists have a plethora of options available to them. Features that may appeal to one data scientist don’t necessarily work for another. When it comes to offerings from H2O.ai, users expressed different reasons for their choices.
Last week, Datanami was a guest at H2O.ai‘s annual user conference, called H2O World, and had a chance to talk with several customers, including Ruben Diaz, a data scientist with Vision Banco, and Bharath Sudharsan, director of data science and innovation at Armada Health.
Vision Banco is one of Paraguay’s largest banks, with consumer and micro-finance lines of business. An IBM shop, Vision Banco has long relied on Big Blue’s offerings, including SPSS and Cognos tools for BI and analytics. It also relied on an IBM i-based Power Systems server, the modern relative of the venerable AS/400, to run core banking applications, as many banks in Latin America do.
Vision Banco’s began its foray into applied statistics five years ago, when it used SPSS to develop relatively simple logistic regression models, which were implemented as stored procedures directly into the Db2 for i database that underlies its banking software. The statistical packages helped the company develop predictive applications that are common in the banking industry, such as credit scoring and fraud detection.
Then it brought KNIME and R into the fold to develop more advanced random forest and GBM models, which were generated in predictive model markup logic (PMML) and implemented via REST Web services.
ML for IBM Power
It expanded its toolbox three years ago when it adopted H2O’s open source offerings, which introduced XGBoost, deep learning, and ensemble approaches to developing predictive models that were implemented as portable POJOS and MOJOS. By this time, the company had expanded its analytics bailiwick to include things like propensity to purchase, customer churn prediction, and customer segmentation.
H2O provided an immediate performance boost, Diaz says. “H2O surprised us [with] the speed to train models. Using R in training random forest, it could take hours. But with H2O, that takes just minutes,” he says. “In the data science process, time is money. If you can build a model faster, you can do more experiments.”
The next chapter of Banco Vision’s analytics story began just three months ago, when Diaz took delivery of a IBM Power AC922 server equipped with Power 9 CPUs and NVidia Tesla V100 GPUs. The powerful new rig was paired with H2O’s Driverless AI, the company’s latest data science platform. Not all machine learning libraries run on IBM’s 64-bit RISC processor, but Driverless AI is one of them.
“It’s a great hardware with a great software,” Diaz says. “I modeled a propensity to buy for credit card offers for the call center, and we doubled the response with previous [models]. That was a good result.”
Diaz, who’s part of a team of seven analysts and data scientists at Vision Banco, also entered an Analytics Vidhya data science competition using the Driverless AI software, and finished eighth, behind seven Kaggle Grandmasters. That finish surprised Diaz, who wasn’t expecting to do so well.
“As a data scientist, it [Driverless AI] makes my job easy,” he says. “It’s easy to deploy too. Sometimes people forget the importance of deploying.”
Bharath Sudharsan had an entirely different reason for selecting Driverless AI at Armada Health, a healthcare startup that helps to match patients with the right doctors.
“Because of the feature engineering,” he tells Datanami. “The open source version doesn’t provide the feature engineering. It provides model optimization, so it does parametric optimization across different models and different algorithms. But then you still have to do the feature engineering in the first place. With Driverless AI, it lets us get the feature engineering out of the box.”
Armada Health juggles a couple of hundred variables across its databases, which contain data about one million patients and one million doctors. The company analyzes the needs of each individual, crunches health outcome data from Center for Medicare and Medicaid Services (CMS), and then does its best to match patient needs to doctors who are specialists in their field.
“We are in the business of connecting physicians with patients, so it’s very important for us to grasp as rich a profile of a physician as possible so we can make the right recommendation,” Sudharsan says. “We perform a multi-level physician evaluation, and based off of that, we pick physicians who are not just in that area who are available, but who are also really good at what they do.”
If Armada had to do that work manually, it would take years, Sudharsan says. But that doesn’t mean the company lets the algorithms drive all by themselves.
“We use a hybrid approach, not full-on machine learning,” he says. “We use [Driverless AI] as a way of understanding certain aspects, of connecting the dots about a physician and the characteristic which otherwise we may not thinking about or is not obvious.”
Driverless AI is used to enrich the data before it’s fed into a recommendation engine built using other software. The results are then overseen by a human worker, who works with the patient to choose the physician.
Armada conducts its predictive work on Microsoft Azure, and it uses an array of other technologies, including Apache Spark and Scikit-learn. Ultimately, the company opted to license the proprietary Driverless AI offering to power a good portion of its data science work.
“It’s a balance between which one has been tried and tested, but also would make things easier for my team,” he says, “as opposed to starting from scratch and trying to figure out what is the best way to do something.”