Data Analysis Using Jupyter Notebook

Share And Visualize Important Data Findings: Jupyter Notebook

In a world where data is king, the tool stack used for processing and analyzing it is of growing importance. Specifically, turning the contents of those 1000 custom-formatted files into insights that drive concrete business actions requires a tool capable of both manipulating and visualizing trends in an intuitive fashion.

Finally, having discovered something worth sharing comes with another set of challenges: how do you share it in practice? Or more importantly, how do you make people trust your findings?

One take on this is the Jupyter Notebook. Jupyter Notebook is a spin-off of the IPython notebook project that emerged as the result of the limited capabilities in working interactively in a console. It captures both the REPL experience in getting instant feedback on code, while also including the documenting capabilities by graphically rendering Markdown and plots inline.

This way of working is related to Donald Knuth‘s thoughts on literate programming, where the logic of the program is intended to be read by humans instead of machines.

One of the reasons that Jupyter became an independent project is that it supports multiple interpreters, such as Julia, Python, and R. This makes you able to use the same computing environment for different languages.

The interpreter acts as a processing kernel for Jupyter Notebook and computing requests are sent from Jupyter’s interface to the kernel. The kernel processes the request and sends the result back to the Jupyter interface. To get more information on the designed architecture, see here.

Basics

The interactive environment is located in the web browser. It consists of input cells (which you execute) and output cells containing the results of the executed code.

The content of the input cells may either be code executed against a chosen kernel, or it could be markdown formatted text (even latex) allowing you to embed the description of the work process right next to the actual code.

In this sense, Jupyter Notebook is the data science equivalent of a lab notebook as used by researchers performing experiments. It is the recipe for reproducing your analysis not only for yourself, but also for anyone interested.

To see this in action we have prepared a notebook on IBM’s Data Science Experience (DSX) platform – go take a look.

Mature Tool in a New Environment

Jupyter Notebook is in itself not useful for performing data science tasks. It depends on the ecosystem around the kernel being used. For Python, the packages numpy, pandas, and matplotlib have emerged as the de-facto standard when performing analysis of time series.

Jupyter allows data scientists to leverage these packages in a shareable environment that is directly accessible from a web interface. This may also be the reason why large organizations like IBM and Microsoft are adopting the technology as a service.

Exposing Jupyter as a cloud service, the user is allowed to integrate with other cloud services, such as an Object Storage service or an Apache Spark service. With these possible combinations of service integrations, the developer is able to quickly prototype using all sorts of data sources.

The Future

Currently, Jupyter notebook strives to deliver a more complete IDE experience and has therefore created a project called JupyterLab, which is the next generation of the Jupyter notebook.

This includes a file browser, terminal, and code/notebook editor. Furthermore, it will include the most popular extensions to Jupyter, such as auto save and code prettify, which is also found in common IDEs.

The large organizations also see a big potential in adapting the technology into their existing services. For instance, IBM has created the Data Science Experience that uses Jupyter notebook as its core, but they have also given an example showing how to integrate it with their SPSS Modeler. This creates a playground for interacting with IBM services in a fast and reproducible manner.

How the core of the Jupyter will develop is uncertain, but currently the community around it, is both large and active.