The Rise of Data Science Notebooks
Interactive notebooks are experiencing a rise in popularity. How do we know? They’re replacing PowerPoint in presentations, shared around organizations, and they’re even taking workload away from BI suites (more on that later).
Even though they’ve become prominent in the past few years, they have a long history. First notebooks were available in packages like Mathematica and Matlab, used primarily in academia. More recently they’ve started getting traction in Python community with iPython Notebook. Today there are many notebooks to choose from: Jupyter (successor to the iPython Notebook), R Markdown, Apache Zeppelin, Spark Notebook, Databricks Cloud, and more. There are kernels/backends to multiple languages, such as Python, Julia, Scala, SQL, and others.
Traditionally, notebooks have been used to document research and make results reproducible, simply by rerunning the notebook on source data. But why would one want to choose to use a notebook instead of a favorite IDE or command line? There are many limitations in the current browser based notebook implementations that prevent them from offering a comfortable environment to develop code, but what they do offer is an environment for exploration, collaboration, and visualization.
Notebooks are typically used by data scientists for quick exploration tasks. In that regard they offer a number of advantages over any local scripts or tools. When properly set up by the organization, a notebook offers direct connections to all necessary sources of data, without additional effort on the part of the user. While it may seem like a trivial task, connecting to the right data source can be far from simple. Even a medium organization will have multiple different analytical systems, operational databases, object and blob stores, each requiring its own driver or API, permission options, and credentials.
Notebooks also tend to be set up in a cluster environment, allowing the data scientist to take advantage of computational resources beyond what is available on her laptop, and operate on the full data set without having to downsample and download local copy. Additional computational power also enables complex processing, even quick machine learning model training that would be intractable on a local machine.
Another major advantage of notebooks is that they can be shared and collaborated on with the ease of using a Google Doc. For an example of what can be done with a shared notebook, check out this demo of Databricks notebook running Spark. As organizations try to become more data driven, enabling conversations around data with access to all queries, assumptions, formulas, and models, instead of final reports, is critical. This collaboration reinforces reproducibility, where the final report can be rerun with different assumptions by anyone who has access to the notebook, and doesn’t require sequences of shell command invocations only known to the original author.
Finally the notebooks are starting to offer advanced interactive visualizations. These range from simple line charts and bar graphs to maps and custom D3.js visualizations. This capability is expected by data scientists used to matplotlib or ggplot. Notebooks are now starting to be used to power dashboards, and are taking some of the workload traditionally done in a BI tool.
At Strata + Hadoop World in March Ian Andrews asked – have we reached peak BI? Has the business intelligence tool market been saturated? Many organizations have BI tools to give them access to the data, but these tools are typically limited to data sets queryable with SQL. Significant amount of information is available in schemaless formats or requires complex processing that is not trivial to perform in SQL. Notebooks are becoming a viable alternative to a BI suite, simplifying the steps needed to process raw event logs, thereby democratizing access to data. There is definitely a lot of room for improvement, especially in the area of user experience, but one could argue interactive notebooks are already offering a very compelling alternative.
About the author: Dan Osipov is a principal consultant for Applicative LLC focused on helping companies tackle data engineering challenges. His expertise includes building pipelines and streaming systems. He sometimes tweets at danosipov