People to Watch 2019
Ali Ghodsi is the CEO and co-founder of Databricks, where he is responsible for the growth and international expansion of the company. He previously served as the VP of Engineering and Product Management before taking the role of CEO in January 2016. In addition to his work at Databricks, Ali serves as an adjunct professor at UC Berkeley and is on the board at UC Berkeley’s RISELab. Ali was one of the original creators of open source project, Apache Spark, and ideas from his academic research in the areas of resource management and scheduling and data caching have been applied to Apache Mesos and Apache Hadoop. Ali received his MBA from Mid-Sweden University in 2003 and PhD from KTH/Royal Institute of Technology in Sweden in 2006 in the area of Distributed Computing.
Datanami: Apache Spark emerged about five years ago amid much fanfare, and has largely maintained that excitement level while other frameworks (like Hadoop) have faltered. What do you attribute Spark’s lasting success to?
Ali Ghodsi: Two things, its extreme flexibility and it’s extreme simplicity. Just to make it more concrete, for the first time it actually enabled you whenever you have lots and lots of data, more than you can fit on one machine, you can now do all sorts of thing with it. You can now do machine learning so now you can do predictions on that massive amount of data. If you had a big social network that you wanted to find friends in or find fraud, you can do that now. If you had real-time problems and wanted to do real-time detection of things, you can do that. Even if you wanted to do things like SQL warehousing you can do that, so that type of flexibility to go between these many types of use cases so easily is really the main reason I think it took off. The Spark project called it “unified analytics” — it unified all of these different types of analytics under one framework, whereas Hadoop didn’t have, for instance, the machine learning, SQL, or other components, such as the real-time component which wasn’t there. So bringing what we call “unified analytics” under one umbrella is what made it super powerful.
Datanami: You’ve continued teaching at Berkeley while leading Databricks as its CEO. Is it difficult to do both, and if so, what’s the secret to working in academia and industry?
When you’re doing both you really have to work double hours, so that’s the secret. I have to follow a really strict schedule, for example, if I have a lecture on Monday I would utilize Sunday as my prep day for Berkeley and try to stay away from Databricks. If you have a strict schedule and really adhere to it, you can make it work. Actually, the two help each other. The research can be more foundational whereas the work we do at Databricks can be a little more pragmatic and applications oriented. We tend to have students that cross over, a good amount of Berkeley students have now joined Databricks and I find the interplay between those two is super important.
Datanami: In addition to a shortage of data scientists, the industry has coped with a shortage of data engineers. In your opinion, which persona is more critical to succeeding with big data at this point in time?
Okay, I’m actually going to disappoint you there and say that I’m not going to pick a persona. Basically what has happened is that we’re finding if you use algorithms from the 70’s and you add massive amounts of data to it you get fantastic results. You need both the data and you need the AI, so you need both data science and you need data engineering. That’s really the whole point, if you only have one you wont succeed. I would say the combination of the two is what’s really important and that’s what Spark did with unified analytics, is combine the data engineering with data science. If you can find someone who can do both, that’s awesome! If not, you have to find a way to have them collaborate because if you only have one, you’re in trouble.
Datanami: AMPLab had a string of hits with Spark, Mesos, and Tachyon (now Alluxio). How has that commercial success impacted the successor project at Berkeley, the RISELab?
The history of Berkeley goes back a long time, I came to Berkeley with great respect for the system that had come out of Berkeley originally. BSD Unix was built at Berkeley, Sendmail came out of Berkeley, so many generations have come to Berkeley with that respect. It was awesome that AMPLab could produce those projects and because of those projects, there are now new generations of students coming to Berkeley that really understand the lineage and they’re able to continue that tradition of building systems that will have a massive impact. It’s enabled Berkeley to get better students than before. It’s the number one on the planet now if you want those people, who build those systems with that impact. There’s no better place to look than Berkeley. If you think about it, many of these projects take a long time before they actually become popular. Spark was evolved in 2009 and media and industry only completely fell in love with it in 2015. It took six years, so those projects at RISELab will take six to seven years and there are many. There’s a project called Ray which I’m not overly familiar with, but it’s an exciting one that we’ll see in the next decade.
Datanami: Outside of the professional sphere, what can you share about yourself that your colleagues might be surprised to learn – any unique hobbies or stories?
Well, I’m a gym rat but that’s not much of a surprise to most people. I’m also a closet want-to-be economist.