May 4, 2016

The Rise of Data Science Notebooks

Dan Osipov

(Sudowoodo/Shutterstock)

Interactive notebooks are experiencing a rise in popularity. How do we know? They’re replacing PowerPoint in presentations, shared around organizations, and they’re even taking workload away from BI suites (more on that later).

Even though they’ve become prominent in the past few years, they have a long history. First notebooks were available in packages like Mathematica and Matlab, used primarily in academia. More recently they’ve started getting traction in Python community with iPython Notebook. Today there are many notebooks to choose from: Jupyter (successor to the iPython Notebook), R Markdown, Apache Zeppelin, Spark Notebook, Databricks Cloud, and more. There are kernels/backends to multiple languages, such as Python, Julia, Scala, SQL, and others.

Traditionally, notebooks have been used to document research and make results reproducible, simply by rerunning the notebook on source data. But why would one want to choose to use a notebook instead of a favorite IDE or command line? There are many limitations in the current browser based notebook implementations that prevent them from offering a comfortable environment to develop code, but what they do offer is an environment for exploration, collaboration, and visualization.

Notebooks are typically used by data scientists for quick exploration tasks. In that regard they offer a number of advantages over any local scripts or tools. When properly set up by the organization, a notebook offers direct connections to all necessary sources of data, without additional effort on the part of the user. While it may seem like a trivial task, connecting to the right data source can be far from simple. Even a medium organization will have multiple different analytical systems, operational databases, object and blob stores, each requiring its own driver or API, permission options, and credentials.

Notebooks also tend to be set up in a cluster environment, allowing the data scientist to take advantage of computational resources beyond what is available on her laptop, and operate on the full data set without having to downsample and download local copy. Additional computational power also enables complex processing, even quick machine learning model training that would be intractable on a local machine.

Another major advantage of notebooks is that they can be shared and collaborated on with the ease of using a Google Doc. For an example of what can be done with a shared notebook, check out this demo of Databricks notebook running Spark. As organizations try to become more data driven, enabling conversations around data with access to all queries, assumptions, formulas, and models, instead of final reports, is critical. This collaboration reinforces reproducibility, where the final report can be rerun with different assumptions by anyone who has access to the notebook, and doesn’t require sequences of shell command invocations only known to the original author.

Finally the notebooks are starting to offer advanced interactive visualizations. These range from simple line charts and bar graphs to maps and custom D3.js visualizations. This capability is expected by data scientists used to matplotlib or ggplot. Notebooks are now starting to be used to power dashboards, and are taking some of the workload traditionally done in a BI tool.

At Strata + Hadoop World in March Ian Andrews asked – have we reached peak BI? Has the business intelligence tool market been saturated? Many organizations have BI tools to give them access to the data, but these tools are typically limited to data sets queryable with SQL. Significant amount of information is available in schemaless formats or requires complex processing that is not trivial to perform in SQL. Notebooks are becoming a viable alternative to a BI suite, simplifying the steps needed to process raw event logs, thereby democratizing access to data. There is definitely a lot of room for improvement, especially in the area of user experience, but one could argue interactive notebooks are already offering a very compelling alternative.

About the author: Dan Osipov is a principal consultant for Applicative LLC focused on helping companies tackle data engineering challenges. His expertise includes building pipelines and streaming systems. He sometimes tweets at danosipov

In-Memory Analytics to Boost Flight Ops for Major US Airline

Applications: Enterprise Analytics

Technologies: Frameworks

Sectors: Financial Services

Tags: Apache Zeppelin, data science notebooks, iPython, Jupter

Only registered users may comment. Register using the form below.

Check off newsletters you would like to receive*
- HPCwire
- EnterpriseTech
- Datanami
- Technology Conferences & Events
- Advanced Computing Job Bank
- Technology Product Showcase
Email*
Name*
First Last
Organization*
Job Function*
Industry*
Country*
City*
State*
Province*
- Please check here to receive valuable email offers from Datanami on behalf of our select partners.

The Rise of Data Science Notebooks

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Sponsored Partner Content

Get your Data AI Ready – Celebrate One Year of Deep Dish Data Virtual Series!

Supercharge Your Data Lake with Spark 3.3

Learn How to Build a Custom Chatbot Using a RAG Workflow in Minutes [Hands-on Demo]

Overcome ETL Bottlenecks with Metadata-driven Integration for the AI Era [Free Guide]

Gartner® Hype Cycle™ for Analytics and Business Intelligence 2023

The Art of Mastering Data Quality for AI and Analytics

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Top 6 Strategies for Reducing Data Warehouse Costs

Building an Operational Data Warehouse for Real-time Analytics

Sponsored Multimedia

The Power of DataOps: Bring Automation to Life
No Comments

Tactical Steps for Cloud Migration
No Comments

Immuta Data Access Platform
No Comments

Data Mesh: Fact or Fiction?
No Comments

Contributors

Featured Events

AI & Big Data Expo North America 2024

CDAO Canada Public Sector 2024

AI Hardware & Edge AI Summit Europe

AI Hardware & Edge AI Summit 2024

CDAO Government 2024

The Rise of Data Science Notebooks

Join the discussion Cancel reply

Only registered users may comment. Register using the form below.

April 25, 2024

April 24, 2024

April 23, 2024

Most Read Features

Most Read News In Brief

Most Read This Just In

Sponsored Partner Content

Leading Solution Providers

Tabor Network

Sponsored Whitepapers

Sponsored Multimedia

Contributors

Featured Events

Share

Copy short link