People to Watch 2017
The Cal Berkeley computer science professor was a director at the AMPLab when one of his students came up with the revolutionary idea that would become Apache Spark. Now as the director of AMPLab’s successor, RISELab, Ion Stoica is hoping to guide the next generation of breakthroughs and discoveries around real-time analytics and security.
Datanami: Hi Ion. Congratulations on being named a Datanami Person to Watch in 2017! With the launch of RISELab in January, this is already a big year for you. What do you hope to accomplish in the year to come?
Ion Stoica: The goal of RISELab is to build open source platforms, tools and algorithms to support real-time decisions on live data with strong security. Following AMPLab, the “home” of Apache Spark, Apache Mesos, and Alluxio, RISELab has some big shoes to fill. As such, I hope that by the end of this year we can show the first tangible results.
We are already working on some interesting projects, including Drizzle, Opaque, Ray, Clipper, and Ground. Drizzle and Opaque significantly expand Apache Spark capabilities: Drizzle reduces the latency of streaming and ML algorithms by up to 10x, while Opaque provides strong protection against OS/supervisor being compromised and access pattern leakages for SparkSQL workloads. Ray is a new cluster computer framework that targets reinforcement learning (RL) applications, Clipper is a flexible and general model serving system, and Ground is a metadata manager that captures the context in which data gets used and produced.
We hope that by the end of the year some of these projects would have been already released in the open source.
Datanami: What are some potential applications for the work RISELab is doing that you’re most excited to see?
While there are many directions I’m really excited about in the context of RISELab, let me list three of them.
- Building systems and tools for RL workloads. While recently there have been several highly successful applications of RL, such as playing games with super-human abilities (e.g., Atari games, go) and robotics, the development of these applications is rather ad-hoc due to the lack of adequate tools. Our work on Ray aims to address this situation by making RL applications much easier to develop and deploy.
- Build tools for model serving and inference. While model training has received considerable attention over the past decade, with a large number of systems and frameworks being built to train ML models (e.g., Apache Spark, TensorFlow, Cafee, MXNet), there has been little work in the area of model inference. With Clipper we aim to build a generic platform for model serving and inference that allows users to plug in models trained with different frameworks, and combine them to meet stringent performance and accuracy targets.
- Build algorithms and tools that can “learn on confidential data”. This is a new learning approach we are working on that would allow one organization to leverage datasets owned by other organizations or users, while preserving the data confidentiality of these organizations or incentivize them to share data.
Datanami: What trends do you see as being most important for big data as we look to 2017 and beyond?
There are two major trends I’d like to focus on. The first is the transition from analyzing data to making decisions on the data. Several verticals, such as ad and financial markets, have already demonstrated the economic value of making decisions and taking actions on the data. I believe we will see more and more companies in more and more verticals making this transition. However, this will require new tools and algorithms to make decisions that are secure, robust, and explainable. These are a few challenges that we hope to address as part of RISELab.
The second trend is the transition from data silos to data ecosystems. So far, the same company has provided the service, collected data, analyzed the data, and used this data to implement new features and products (e.g., Google, Facebook, Twitter). While this model will continue to thrive, I expect a rapid growth of alternative businesses in which processing capabilities and data are owned by different organizations. This would require the development of new tools and systems, such as “leaning on confidential data”, as noted above. This trend will enable more and more organizations to access high quality datasets, leverage state-of-the-art decisions systems.
Datanami: Outside of the professional sphere, what can you tell us about yourself – personal life, family, background, hobbies, etc.?
Well, there are not many things about me that people don’t know. Maybe that when I was starting high school I was seriously consider becoming a painter? But that was long time ago. Other than this I like boring and unsurprising stuff, like reading, swimming, and watching F1 races and soccer games whenever I get time.
More about Ion Stoica:
Ion Stoica is a Professor in the EECS Department at University of California at Berkeley. He does research on cloud computing and networked computer systems. Past work includes Dynamic Packet State, Chord DHT, Internet Indirection Infrastructure (i3), declarative networks, and large scale systems, including Apache Spark, Apache Mesos, and Alluxio. He is an ACM Fellow and has received numerous awards, including the SIGOPS Hall of Fame (2015), the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001). In 2006, Ion co-founded Conviva, a startup to commercialize technologies for large scale video distribution, and in 2013, he co-founded Databricks a startup to commercialize Apache Spark.