The Next Data Revolution: Intelligent Real-Time Decisions
Over the past decade, big data analysis and applications have revolutionized practices in business and science. They enabled new businesses (e.g., Facebook, Netflix), to disrupt existing industries (e.g., Airbnb, Uber), and accelerated scientific discovery (genomics, astronomy, biology).
Today, we are seeing glimpses of the next revolution in data and computation, driven by three trends.
First, there is a rapidly growing segment of the economy (e.g., Apple, Facebook, GE) that collects vast amounts of consumer and industrial information and uses this information to provide new services. This trend is spreading widely via the increasing ubiquity of networked sensors in devices like cell phones, thermostats and cars.
Second, recent advancements in deep neural networks, reinforcement learning, and big data machine learning systems have unlocked remarkable AI capabilities ranging from visual perception to superhuman game playing capabilities to saving on power consumption in datacenters, and learning complex locomotion tasks.
Third, a growing number of devices such as security systems, drones, and self-driving cars are autonomously or semi-autonomously taking action in the physical world.
These trends point to a future in which
computing infrastructure senses the world around us, ingests information, analyzes it, and makes intelligent decisions in real time on live data. These abilities can fundamentally improve how both humans and machines interact with the world and each other, while raising critical new issues in security and privacy.
RISELab Rises to the Future Challenges
These are the challenges we set to address at RISELab, a new five year lab we started at UC Berkeley, and which follows AMPLab, the home of many successful open source projects, such as Apache Spark, Apache Mesos, and Alluxio.
There are currently few mature and widely-used examples of real-time decision-making on live data. Two examples stand out, as they helped create enormously successful industries: high-frequency trading (HFT) and real-time targeted advertising.
HFT is now a key component of today’s financial markets, responsible for billion-dollar trading decisions daily. While there is relatively little public information about the performance of these systems, clearly there has been a continuous drive towards real-time ad targeting and bidding with sub-second latencies. These examples, custom-built for their virtual environments, hint at an even higher-impact future reaching into the physical world.
The combination of real-time intelligent decisions on live data with sensing and actuation will enable new categories of applications, such as real-time defense against Internet attacks, coordinating fleets of airborne vehicles, robot assistants for the home, and many others. These applications are all data-hungry and require real-time, intelligent, secure decision-making systems, with techniques for sharing data that preserve confidentiality and privacy, and provide robustness to attacks and security breaches.
Attributes of Intelligent Real-Time Decisions
Next, we discuss the desirable attributes of a decisions system to enable the above applications in more detail:
Intelligent: Decisions that take place in uncertain environments and are capable of adapting to context and feedback are inherently non-trivial. Examples of such decisions are detecting attacks in the Internet, coordinating a fleet of flying vehicles, or protecting the home. One promising approach to implement intelligent decisions is Reinforcement Learning, which has recently been used with great success in varied applications from beating the Go world champion to robotics.
Real-time: Real-time refers not only to how fast are the decisions rendered, but also to how fast are the changes in the environment incorporated in the decision process. For instance, in the case of intrusion detection, we would like to create an accurate model of the attack in seconds, and then decide which are the offending streams and and drop their packets. This is hard, as typically it is a tradeoff between how fast you can train a model on fresh information and how much you can render decisions.
In general, the more of the decision process we materialize in advance, the faster the decision. At one extreme, one can pre-compute all possible decisions. This minimizes decision latency at the cost of model update. At the other extreme, one can directly log the (raw) input data when it arrives, and do all of the necessary computation at the time of decision.
One research challenge is to explore this tradeoff in greater detail and identified mechanisms to trade off these two latencies dynamically. As latencies go below human reaction times, decisions must be automated. Without humans in the decision loop, we need to make sure that these decisions are robust, explainable, and secure.
Robust: Robust decisions work well in the presence of complex noise, unforeseen inputs, and system failures. For example, a system coordinating a fleet of airborne vehicles will have to deal with the noisy inputs provided by the sensors of the airships (e.g., a blurry video feed during heavy rain). As another example, consider an application that aims to detect Internet attacks (e.g., viruses, worms). Since these attacks are continuously evolving, such an application will have to deal with previously unseen attacks.
Explainable: When an automated decision is not obvious, people naturally want to know what led to it. For example, why was a mortgage application declined? Similarly, why did an algorithm diagnose a patient with cervical spine instability based on her x-ray? Explainability is so important that today, many organizations choose to trade off accuracy for explainability by deploying simpler algorithms whose outputs are easy to explain (e.g., decision trees) instead of more accurate but less explainable ones. This problem is exacerbated by the popularity and successes of deep learning (DL) in domains such as self-driving vehicles and fraud detection; DL systems are hard to interpret, which make decisions even harder to explain. Explainability is related to but different from interpretability: explainability answers “why” a particular decision has been made, while interpretability answers “how” a particular algorithm arrived to a decision.
Secure: As companies such as Google and Facebook have demonstrated, there is tremendous value in leveraging users’ information to make targeted decisions. Furthermore, it is increasingly desirable to combine data from multiple organizations to provide novel services in financial, insurance, and healthcare markets.
However, leveraging a user’s personal information or an organization’s data is increasingly fraught, even when it should be mutually desirable. People have heightened awareness of the risks of disclosing personal information, and governments have enacted stricter regulatory constraints (e.g., data collected in a country cannot leave its borders). These concerns are further exacerbated by security breaches that have already accessed vast amounts of private or confidential data.
To address these security challenges, we need to develop new algorithms that can provide contextual decisions while guaranteeing users’ privacy and data confidentiality. Such strong security guarantees will lower the barrier for users and organizations to let their data be used in return for better decisions. In addition, many applications and services are being deployed in public clouds, such as Amazon Web Services, Microsoft Azure, or Google Cloud Platform. As such, providing both data and computation integrity is critical to protecting and securing these services from malicious employees of cloud providers, tenants who share the same cloud infrastructure, or external hackers.
While ensuring these security properties is difficult, the real challenge is doing so while preserving the functionality and the performance of these applications.
Goals and Early Results
To address these challenges and enable these applications, we need a new generation of systems, tools, and algorithms that far surpass the capabilities of the existing ones. Furthermore, we need these tools to be open and approachable for a wide range of creative application developers, similar to how big data analytics tools, such as Apache Hadoop and Apache Spark, are available today.
The goal of RISELab is to build such open source tools and platforms to enable virtually any developer to build sophisticated decision-making and predictive analytics applications that can fundamentally change the way we interact with our world, harnessing the increasingly fine-grained and real-time sensory data collected by individuals and organizations alike. While RISElab is just a few months old, we are already working on several processing projects:
- Drizzle, a low latency execution engine for Apache Spark targeting stream processing and iterative workloads. Drizzle improves the latency of Spark Streaming by 10x and brings it on par with the specialized streaming frameworks, such as Apache Flink. Drizzle has already started to be integrated in Apache Spark;
- Clipper, a low-latency prediction serving system. Clipper employs a modular architecture to simplify model deployment across various ML frameworks, such as Spark’s MLlib, TensorFlow, and SciKit-Learn. At the same time, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks;
- Opaque, a new system to securely process SparkSQL workloads. Opaque provides strong security by protecting even against attackers who have compromised the operating system or hypervisor. Furthermore, Opaque provides an “oblivious” mode that protects against access pattern leakages, where the attacker can extract information by just observing the access pattern of a query over the network or to memory, even when the traffic is encrypted;
- Ray, a new distributed framework targeted to reinforcement learning and other large scale ML applications. Ray aims to support execution graphs with fine grained dependencies, and schedule effectively msec level latency tasks.
About the author: Ion Stoica is a computer science professor at UC Berkeley and a director of the RISELab. Formerly, Stoica was a director of the AMPLab. Stoica is also a co-founder and executive chairman of Databricks, and is one of Datanami’s People to Watch for 2017.