Ben Lorica on What to Expect at Strata Data Conference
Thousands of data-obsessed technologists will descend on Silicon Valley this week to take part in the Strata Data Conference’s annual West Coast swing. Datanami caught up with O’Reilly Media’s Chief Data Scientist Ben Lorica, who’s also Strata’s program chair, to get the low down on the show’s high-tech expectations.
One of the first things that attendees will notice is the name change. The last time O’Reilly Media and Cloudera‘s show came to the San Jose McEnery Convention Center, it was called Strata + Hadoop World. With the influence of Hadoop waning, the show organizers opted to return the big data conference to its original name.
While Hadoop is still a big driver in the big data world, it’s not the force that it once was, Lorica says. “Hadoop as a storage system is still in a lot of companies, but a lot of companies are also using a variety of other storage systems,” including cloud-based object stores, he says. “I don’t think that Hadoop is as big a piece of the storage in the enterprise as it would have been four years ago.”
People who attend Strata hoping to glean some insight on how to design infrastructure to support large, distributed data-oriented applications will be exposed to a wide range of technologies and techniques. Apache Hadoop is just one of many infrastructure topics that will be explored at the show, Lorica says.
“There’s about the same number of sessions on data engineering and architecture as data science and machine learning, but the data engineering and architecture [sessions] in the past would have been more Hadoop-focused,” he says. “Now there’s a lot of cloud. There are a lot of projects that are not Hadoop. Besides Spark and Kafka, there are a lot streaming and real-time projects. There are a lot of sessions on…how do you architect data platforms and data pipelines, how do you architect in the cloud and on-premise.”
Momentum is still strong around Apache Spark, which can scratch multiple itches for big data types, including infrastructure engineering and data science and machine learning needs. Lorica mentioned a couple of Spark extensions that he’s been tracking, including a natural language processing library called Spark NLP, and BigDL, the deep learning extension to Spark open sourced by Intel.
“You don’t have to install anything” to use BigDL, Lorica says. “You just go from your data wrangling and data preparation inside Spark directly to deep learning…Having said that, we have session and tutorials and trainings on all the major deep learning libraries at the conference.”
There are hundreds of sessions that Strata attendees can choose from over the four-day show. This is a good way to acclimate oneself with some emerging technologies, such as graph databases and streaming data systems, which are hot areas, Lorica says.
“There are a lot of new systems that have come out in the last 18 months,” says Lorica, who goes by @BigData on Twitter. “I was surprised when they first came out because I thought that a lot of these systems were already mature. But then when you talk to the people who are dealing with these types of data, there are still things that they want out of their systems. There’s a new generation of systems for temporal data and graph that will be at the conference.”
There has been a stronger focus on data engineering and infrastructure topics at past Strata shows, which isn’t surprising considering the strong influence from Hadoop. That corresponded with a focus on working with structured data. But now, as data science and machine learning are getting hotter, there’s a bigger focus on incorporating unstructured data into the mix, Lorica says.
“The other thing that Strata will evolving towards is integrating more types of data that normally the Strata community hasn’t used in the past,” he says. “The Strata community has mostly been structured data, maybe logs, and instructed text. I think as deep learning becomes more accessible, you’ll see more Strata [sessions] working with images, for example. That’s an easy move for them because a lot of the deep learning libraries and frameworks, the working examples are images.”
Lorica will again be joined on Strata’s main stage by Doug Cutting, the Hadoop co-creator and chief architect at Cloudera, and Alistair Croll, an entrepreneur and author at solveforinteresting.com. The trio will introduce the keynote speakers on Wednesday and Thursday morning.
There will be a pronounced focus on privacy and ethics during the keynote addresses this week, Lorica says. The O’Reilly data scientist himself will present a keynote titled “Privacy in the age of machine learning” on Wednesday. On Thursday, Natalie Evans Harris, who is the founder of Harris Data Consulting and COO of BrightHive, will talk about responsible data practices, what Lorica termed “the Hippocratic Oath of data scientists.”
“I think it’s something that people are just much more aware about,” Lorica says about the role of ethics and fairness in data science. “I think external events have raised the issue in the minds of data scientists. For example, the role of algorithms in the election of 2016, among other things.”
There hasn’t been as great a need to confront issues of algorithmic bias in the past because so few organizations were actually running machine learning systems in production. That has changed considerably over the past few years, Lorica says.
“Now I think the platforms for doing data science are much more widely available,” he says. “More people are able to do…machine learning. I think two years ago, a lot of people were still [doing] BI with big data.”
Lorica highlighted one other sessions that will likely draw a big crowd. Jeff Dean, the Google senior fellow and legendary technologists, will talk about applications of the Google Brain technology in a session titled “Using deep learning to solve challenging problems.”