Can You Trust Your Algorithms?
Algorithms are critical to how we interact with data. And as the volume and variety of data increases, so does our reliance on algorithms to give us the answers we seek. But how much faith should you put into those algorithms, and how can you be sure they’re not misleading you? They’re not simple questions, but through the use of algorithmic differentiation techniques, data scientists can get more precise answers.
Algorithmic differentiation, sometimes called automatic differentiation, is a technique used to ascertain the precision of algorithms based on a certain data set, and to determine the algorithms susceptibility to data volatility. The concepts behind AD originated decades ago in the geology and meteorology fields, and were used to help boost the effectiveness of HPC codes used to predict the weather or tell energy companies where to drill for oil.
AD has proved its value in a variety of use cases where the accuracy of data is critical to the achievement of goals. If the results of an AD test show that a data model breaks down when presented with real-life data inputs, then the owner may scrap the model and start over. Conversely, if AD shows that a model works even with dirtier data, then the owner may be able to save money by dialing down the precision of data collection, such as with a sensor on a weather satellite.
Today, AD is used in a variety of industries where HPC is prevalent, including aerospace and auto manufacturing, where it’s used to optimize the algorithms that determine shapes of wings and car bodies, and in finance, where it’s used to fine tune the algorithms that compose option pricing models, for example.
But as the big data analytics phenomenon drives forward and smaller outfits start experimenting with data mining, AD proponents are concerned that some of the hard-won lessons of AD are not trickling down as quickly as they might.
People and organizations who don’t understand the mathematic fundamentals at work behind the algorithms are at risk of placing too much faith into the accuracy of the algorithms and the answers they generate, according to Uwe Naumann, a Professor of Computer Science at RWTH Aachen University in Germany and an AD expert for the Numerical Algorithms Group (NAG).
“Our understanding in terms of modeling the world is still reasonably limited,” Naumann tells Datanami. “Just storing a whole lot of data and mining it and getting some sort of answer to whatever question I’m asking out of the data using statistical analysis has a very strong random component to it, I believe.”
A lot depends on the data, including when it was measured, by whom, and with what accuracy. “It also depends on the algorithms you use to mine the data,” he says. “Yes of course we can get patterns and yes of course there are many case studies where the patterns really buy you something. But optimizing and calibrating these models to certain situations is, for the foreseeable future, going to be the central component. Without algorithmic differentiation, it’s going to be a major pain.”
Failure to abide by the laws of mathematics could doom some big data projects being susceptible to the dreaded random factor. If there’s one thing that no chief executive wants to hear, it’s that his $5-million data mining factory just turned into an expensive random number generator.
“Big data is not just about storing and processing data and moving it from A to B,” Naumann says. “What I think is still important, no matter how much data you can keep around, you should still have a pretty good understanding of the significance of data, and significance means sensitivities of what you’re trying to do with the data, with respect to the data or potentially with respect to how this data was retained. This is something that still requires sophisticated mathematical modeling and simulation and potentially AD as a technique.”
During the recent ISC conference in Germany, NAG announced an expansion of its AD service. As part of the service, NAG will evaluate a customer’s algorithms and tell them what sorts of sensitivities it has to data, which can be quite useful for parameter calibration. Access to source code is preferable, or at least access to the algorithm or model developers. NAG will also train the developers how to perform their own AD testing, which is useful as part of ongoing regression testing.
NAG has been in the AD business for decades, primarily with HPC clients. Now that we’re in the midst of a big data boom, the potential use cases for AD is expanding, and so is the pool of customers who can benefit from it.
“I see many people who should use it,” Naumann says. “Many of them realize it as we speak and get involved. But there are also plenty of people who have never heard of it. I’m surprised when I talk to people how many of them are surprised by what I’m talking about. It’s not that well-known yet.”
We’re in the midst of a transition phase, where organizations that are building big data mining solutions on newer technologies, like Apache Hadoop and Apache Spark, are re-learning some of the same lessons that HPC experts learned years ago. By using HPC techniques like AD, big data practitioners have the opportunity to jumpstart their projects, and potentially leapfrog their competitors.
But just getting the word out about AD has proved to be a big challenge too. “Being in this transition phase, we see more and more people who have only a vague understanding of what their software does, what the math behind the software is,” Naumann says. “They just want to solve the problem. They don’t care all that much how this is done or what is happening. For those people I would say they don’t understand enough. On the other hand, I would say if you could hide the [AD] methodology and just provide those people a number, or a red flag or green flag, that may work too.”
As big data infiltrates the world around us and the results of algorithms have a bigger impact our lives, it behooves us to have at least a modicum of understanding of the math at play, or at least to check with somebody who does. Because if there’s one lesson we can learn in big data, it’s that all algorithms are not created equal.