Follow Datanami:
June 7, 2016

IBM Seeks Data Science Unity with New Spark-Based ‘Experience’


IBM today launched what it’s calling the first enterprise application for data science collaboration. Called the Data Science Experience, the free, cloud-based offering is aimed at enabling data scientists to perform tasks like prepping data and building machine learning models in an open and shared environment.

Developed on Apache Spark, IBM likens the Data Science Experience to an integrated development environment (IDE) where data scientists have a place to do their work. Data scientists can work with the software, which is largely based on the open source Jupyter notebook, using a variety of languages, including R, Python, and Scala. It also exposes Spark’s machine learning library, called MLlib, as well as IBM’s SystemML (which IBM contributed to Spark last year) and certain SPSS algorithms.

The offering is intended to bring data scientists together, says IBM’s vice president of product development for analytics, Rob Thomas.

“The biggest problem that we see in organizations today is data science is a very fragmented profession,” Thomas says. “It’s very much an individual sport, where they have one language they like, one tool, they work on their own and you hope they get to meaningful insight. But there’s not a lot of collaboration.”

IBM intends to jump start that collaborative process with Data Science Experience. “DSE is about bringing your expertise, whatever it is, bringing your tool whatever it is–whether you like to work in R or Python or Scala  or SPSS or anything–and we’ll give you an open environment built on open source where you can collaborate and share those models. Basically it’s how you learn, how you make, and how you collaborate around data science all in one environment.”

In addition to hooking into popular data science languages like R, Python, and Scala, IBM is enabling Data Science Experience users to use tools from its partners, such as H2O and R Studio.

Data Science Experience provides functionality to help users ingest and prep data, as well as training and evaluating machine learning models. It also offers access to 250 curated data sets. The product does all this in an open and collaborative way that fosters openness among a group of data scientists, Thomas says.

“This is about the open application of analytics and data science and trying to get away from the closed wall idea,” he says. “It’s about bringing data science to the masses and enabling machine learning, given that most organizations are struggling with how to get off the ground with it.”

For example, consider the case of a retailer that has hired two data scientists, each of which is working in different languages and data types. “The Scala person builds some models for customer data, and then they find something interesting in the customer data,” Thomas says. “They publish that and say ‘Look what I found here. This is what I built. This is what the model says, here’s the regression.’

“The R guy says ‘That’s interesting. I was working off this product data and I hadn’t see that. What if we put those data sets together. Does that lead to a different set of outcomes?'” Thomas says. “So they might be trading data sets or trading actual models. They might be sharing the output of the models. But the point is we just enabled a discussion that never happens because today they sit in different parts of the building or different parts of the world, so that’s enabling a discussion that doesn’t happen naturally.”

IBM is running Data Science Experience in the cloud. Cutomers must upload their data to IBM’s cloud to make it work. It’s currently giving the software away for free, but in the future, IBM may choose to charge for certain features, such as running the machine learning models in real time to score incoming data.

“We’ll see the power of it. It will get bigger over time,” Thomas says. “Where this is going is toward a cloud-based platform of compostable data services which can drive all the analytics for an enterprise–whether the data is in Hadoop or a columnar DB or other NoSQL DB – it doesn’t’matter.”

Related Items:

The Rise of Data Science Notebooks

Why Self-Service Prep Is a Killer App for Big Data