When Big Data Isn’t Enough
The big data paradigm has changed how we make decisions. Armed with sophisticated machine learning and deep learning algorithms that can identify correlations hidden within huge data sets, big data has given us a powerful new tool to predict the future with uncanny accuracy and disrupt entire industries. But what if data alone isn’t enough? What if some decisions can’t be based just on data?
That may sound like a heresy to people who have devoted themselves to the religion of data, to the business leaders who have declared their allegiance to making data-driven decisions. Sure, there may be obstacles to overcome, such as data cleanliness, governance, and security concerns. But armed with enough data, one can solve any challenge, they say. Just look at the results. Right?
While a big data approach does work in many instances, there are some cases where it will fail to deliver solution. That’s the declaration made by Michel Morvan, the co-founder and CEO of The CoSMo Company, a 65-person French-American technology firm with expertise in modeling complex systems for clients in the electric distribution, transportation, and pharmaceutical industries.
“This machine learning approach is extremely efficient for some domains in some situations,” Morvan tells Datanami. “But in other cases, there are some limits to this approach.”
The first limit is that big data is designed to predict phenomena that have happened before. That limits its usefulness for predicting unusual events that we know can happen but which are not expressed in the data.
“The machine learning model is going to look at all the data in the past, and it’s very sophisticated at finding correlations in these data,” Morvan says. “Then you can say, hey I find a correlation between this and that. If I assume this will happen in 10 years, then I can deduce that that will also happen. That’s the way I’m going to predict with a big data approach.”
The second limitation of the big data approach is that it’s largely a black box. The algorithms will find many correlations and use them to make predictions about what will happen in the future. But it’s often not clear how the algorithms came to the conclusions that they did. There is a mathematical basis for the prediction, but the function may be missing key components that impact the actual outcome in the real world.
“You cannot explain the why,” Morvan says. “In some cases, this is OK. But in other cases…you really need to explain why they’re going to happen that way.”
At CoSMo (which stands for “complex systems modeling”), Morvan and his colleagues seek to answer the “why” before attempting to answer the “what.” This involves creating complex models that accurately reproduce the forces and dynamics of the system CoSMo’s clients want to model in the real world. In short, it’s seeking to find the root source of an event—the underlying cause—as opposed to merely exploiting correlation.
This train of thought will be familiar to those in the high performance computing (HPC) world, where scientific phenomenon are broken down into their core physical, biological, or chemical parts, and reproduced (to the best of the programmers’ ability) in a computer model. This model is then used to run simulations on a supercomputer that are the bases for predictions. If done correctly, this technique gets you answers with greater fidelity and accuracy than one can get by using the big data approach that relies on correlations, which may be simultaneously useful and unexplained.
According to Morvan—who was a university professor in France and New Mexico before moving into private sector—finding the root cause is critical for making predictions in complex systems. While in academia, Morvan used his expertise in complex systems to help biologists answer one of the most elusive challenges, which was to figure out where the shape of the human body came from.
“The biologists had a huge amount of data beginning in 2000,” he says. “They wanted to solve all these problems by just looking at data, and it was not possible. It’s impossible to understand the shape [of the human hand] by just looking at data.”
A similar dilemma is faced with CoSMo clients who want to model other complex systems, such as the electrical grid and transportation networks. But finding the root causes of why complex systems behave like they do is also harder, and has a lower margin of error than the big data approach.
“That’s why, when we create the first application for the first given problem we want to solve, we spend a couple of months with the expert to make sure, piece by piece, that what we’re building, and the knowledge we’re extracting form the expert, is good and what we have makes sense,” he says.
Models, Then Data
Of course, CoSMo uses data to make predictions. But before loading any real-world data into its models, the company attempts to replicate the underlying phenomena in code. The company has developed a modeling language, called CoSML, that it uses to model complex systems.
One customer engagement had CoSMo modeling the electrical grid with the goal of limiting power outages. In this situation, the model had to incorporate many physical variables, including the capacity of transmission lines, the longevity of transformers, the availability of skilled workers, and the costs of all of the above.
Getting these variables working in a single model isn’t easy, Morvan says. “What we’re very good at is being able to put all of that together, and run all of these models together,” he says, “and that’s why we have these results that nobody else can have on these projects.”
Only after the model has been developed and the key variables defined—which can take months or more to get right—does CoSMo get to the actual data. The tools used at this point in the game will be familiar to those in the big data world, including ETL tools for transforming and loading the data, and Apache Spark to run the model in a distributed manner.
“One thing that’s very important for us is to be able to run multiple parallelized simulations when we want to do thousands of simulations for knowing how to find an optimal solution,” Morvan says. “So we need to have an environment that can be parallel, and the technology behind it is Spark.”
While big utilities with multi-billion-dollar budgets have a plethora of BI and analytical tools on hand, those tools generally lack the capability to help users make long-range predictions about things like the state of electrical grids.
Complex systems like the Western Interconnection cost many billions of dollars per year to maintain, and strategic planning is critical to ensuring that the right equipment and the right workers are on hand to deal with expected failures, as well as unexpected ones. A failure of planning can leave grid operators in a budgetary bind and increase the odds of cascading failures bringing down the entire system.
CoSMo’s clients regularly simulate what the grid will look like in two, five, or 10 years—even up to 40 years out in some situations. This is not something that a traditional big data/machine learning approaches can tell you.
“You may say ‘OK for this kind of equipment I’m going to wait two years before replacing it, or for this kind of equipment, I’m going to maintain it and only replace them [when they fail],'” he says. “But before you work with us, you have absolutely no way to know what will be the impact of the decision you’re going to make. You don’t want California to be in the dark because there’s no more electricity.”