DataRobot Delivers an ML Automation Boost for Evariant
Companies in all industries face an acute shortage of data scientists, those digital alchemists who turn raw data into gold. But for the healthcare software company Evariant, the decision to use DataRobot to automate its machine learning operations was a force multiplier for its data science team.
Evariant develops a healthcare-focused CRM (customer resource management) system that’s used by large hospitals and hospital chains across the country, such as Scripps Health, Dignity Health, and Wake Forest Baptist Health. The Connecticut company helps hospitals optimizing marketing and outreach efforts by analyzing a host of medical data to generate predictions about what patients might do next, or what treatments they may need next.
Evariant’s propensity modeling business has increased in recent years. Hospitals found the company a reliable source of actionable information about their clients, their specific conditions, and possible future medical treatments–all sufficiently anonymized for HIPAA compliance. For example, if somebody’s records show he might be a candidate for Type 2 Diabetes screening, or might need help to quit smoking, Evariant can detect that signal using machine learning models that run against the latest data. No data science expertise is required by the hospitals, and marketing professionals can access the system through a Salesforce interface.
At first, the company’s data scientists mostly-hand coded the algorithms and used manual methods to determine what was working well and what needed more tweaking. At runtime, the models would run on a small Hadoop cluster to score fresh data. The process was complicated by the fact that individual models were developed for each customer.
“It can be pretty expensive,” says Kent Mitchell, an engineering partner at Evariant. “To develop and build one model, it might take a developer a few days. If you want to test 10 or 15 different models, it might take you a month to develop these different models and build them and test them.”
Machine Learning Automation
Several years back, the company looked for ways to automate this process. Evariant analyzed the options on the market, and ultimately selected DataRobot, a machine learning automation firm founded five years ago by a group of data scientists who were successful on Kaggle.
The DataRobot product includes hundreds of open source algorithms developed in R, H2O, Vowpow Rabit, Scikit Learn, Spark ML, and others. The software automatically picks which algorithm, out of the hundreds available, works best against a customers’ specific data. After automatically tuning the algorithm, the DataRobot software can automatically deploy the machine learning job as a Python-based Spark job to run on Hadoop, Linux, or other big data clusters.
DataRobot does many of the rote tasks that would otherwise be performed by a data scientist, says Razi Raziuddin, vice president of strategic business development for DataRobot. “It’s taking a lot of the steps that really good data scientists go through in order to build a very accurate model,” he tells Datanami. “DataRobot automates a lot of that and simplifies it, even for non-data scientists and business users.”
DataRobot paid immediate dividends for Evariant, Mitchell says. “DataRobot gives us a very good way to quickly run a bunch of different kinds of models,” Mitchell says. “It can do this much, much quicker than a person can.”
Instead of requiring Evariant data scientists to create and test dozens of models manually, they let DataRobot do the drudge work, while concentrating on more value-added activities.
“It allows our guys to focus in and spend their time analyzing results and a lot less time on actually sitting down and going through the mechanics of setting up and running different models,” Mitchell says. “Our guys can literally test hundreds of models in a day. They can basically take the data, do a little bit of work with the DataRobot user interface, run across hundreds of different models, get the results back, look at the results, and say ‘This one does statistically better here’ or ‘This one does better statistically better there.'”
Runs as Spark on Hadoop
The models generated by DataRobot are exported into Evariant’s production Hadoop cluster, which is Cloudera‘s Distribution of Hadoop (CDH) running on bare-metal servers in a SoftLayer data center. Mitchell says the integration between DataRobot and Cloudera Manager makes it easy to upload jobs to CDH (however, once the jobs are loaded, the benefits of integration end). And because the Spark jobs generated by DataRobot are YARN-aware, they play nicely with Evariant’s various other Hadoop jobs, including those using the MapReduce, Hive, Impala, and Spark engines.
Evariant bases its predictions using tens of billions of data points gleaned from billions of medical records. That includes 240 million patient records (covering almost every American), each with upwards of 500 different fields. It also has 1 million physician records, 6 billion medical claims, and about 1 billion medical encounter records. Most of the data is updated monthly, with the exception of HL7 medical encounter records, which are updated at the rate of about 10,000 per day.
Trying to find out which data points are predictive would be a difficult and time-consuming process if Evariant had to do it manually. But DataRobot gives the company a big leg-up on the process through the power of automation.
“It can actually understand which fields might be most statistically significant,” Mitchell says. “Sometimes you’ll put in 20 fields, thinking they’re all important. But it turns out 89% of the value comes from three fields…You may go from 89% to 91% accuracy [by processing all 20 fields], but it takes 10 times longer to process. You have to make an assessment if the speed is more important than the accuracy, depending on what the model is being used for.”
One of the unexpected benefits of using DataRobot was that it allowed Evariant to consolidate its modeling. Instead of running separate models for each customer, it now runs a single propensity models that includes all Americans on a regular basis, and then filters out the specific results to meet client needs.
“We’re now scoring 240 million people across 100 propensities and 15 different models,” Mitchell says. “It allowed us to simplify our overall infrastructure.”
Armed with predictions from Evariant, hospitals around the country are able to anticipate their clients’ needs with a greater degree of accuracy, which helps to cut costs while providing better medical care. Thanks to the automated machine learning from DataRobot, Evariant is able to continually improve its offerings.
“At the end of the day, this allows our team of data scientists to be two people,” Mitchell says. “If we didn’t have it, we’d have at least three to four times as many people working in the data science team, and we’d probably have a supporting team of at least four to five programmers. It really is a pretty significant benefit.”