Jump to content

Data Platform/Systems/Jupyter

From Wikitech
An example of the Jupyter interface for data analysis

Jupyter notebooks are a friendly and powerful interface for programming; they work particularly well for data analysis.

In the Data Platform, Jupyter is installed on the stat hosts and works well with either R or Python.

Getting started

Prerequisites

To access Jupyter, you need:

Opening an SSH tunnel

Once you have this access, open a SSH tunnel to one of the stat hosts. There are two main ways to do this. We'll assume you want to connect to stat1008, but you can connect to another host instead by changing the name in the terminal command.

The first option is using the standard SSH command:

$ ssh -N stat1008.eqiad.wmnet -L 8880:127.0.0.1:8880

The second option is to modify your SSH configuration file to automatically open a tunnel whenever you connect to one of the stat hosts:

Match host=!*.*,stat10*
        SessionType none
        HostName %h.eqiad.wmnet
        LocalForward 8880 127.0.0.1:8880

With that added, you can simply use the following command:[1]

$ ssh stat1008

Note that your Jupyter notebook and files will stored be on the chosen host only. If you want to move to another server, you will have to copy your files using Rsync. If you need shared access to files, consider putting those files in HDFS.

Logging in

Open localhost:8880 in your browser and log in with your developer account. Use your shell username rather than your wiki username (e.g. nshahquinn-wmf, not Neil Shah-Quinn (WMF)).

Starting your server

You now have to choose which Conda environment to start your server with. If you've never used this stat host before, the only option will be to "Create and use a new cloned conda environment...".

Otherwise, the environment you last used will be preselected. If it's been a long time since that environment was created and you don't have packages or configuration you want to keep, it's best to create a new environment. You'll be prompted to select an existing Conda environment or create a new environment. See the section on Conda environments below.

Once your Jupyter server has started up, you will see the JupyterLab interface and are ready to start working:

The interface you'll see once you sign into our Jupyter service

Using the Jupyter interface

We use a particular Jupyter interface called JupyterLab. See the JupyterLab documentation for help using it.

Managing your server

Restarting

You'll sometimes need to restart your Jupyter server (for example, to fix problems or to use a different Jupyter environment).

Here's how:

  1. Navigate to the server control panel by selecting FileHub Control Panel
  2. Select "Stop My Server"
  3. Select "Start My Server"

You will now see the dropdown allowing you to choose an existing environment or create a new one:

Installing packages

Many Python packages will be preinstalled in your environment.

If you need a different package, run conda install {{package}} in a terminal. If you need to upgrade to a newer version, run conda update {{package}}. For more information, see Data Platform/Systems/Conda#Installing packages.

Using R

See Data Platform/Systems/R.

Querying data

To make it easier for you to access data from the stat hosts, use the following software packages, which hard-code much of the setup and configuration.

In Python

For Python, there is Wmfdata-Python. It can access data through MariaDB, Hive, Presto, and Spark and has a number of other useful functions, like creating Spark sessions. For details, see the repository and particularly the quickstart notebook.

Authenticating via Kerberos

If you want to access data from Hadoop (whether through Python, R, or the command line) you will need to authenticate with Kerberos. Open a new terminal in JupyterLab (not a separate SSH session). Type kinit on the command line and enter your Kerberos password at the prompt.

Conda environments

Jupyter is set up to use isolated environments managed by Conda. You can create as many Conda environments as you need, but you can only run one at a time in Jupyter. To switch, you'll need to restart your server.

The choice between different Conda environment when starting your Jupyter server

Sharing Notebooks

Copying files on the stat hosts

It's possible to copy notebooks and files directly on the server by clicking 'New' → 'Terminal' (in the root folder in the browser window) and using the cp command. Note that you may have to change the file permissions using the chmod command to give the other user read access to the files.

GitHub or GitLab

It's also possible to track your notebooks in Git and push them to either GitHub or our GitLab, both of which will display them fully rendered on its website. Generally, this requires making the notebook public, but it's also possible to request a private GitLab repo if necessary.

In either case, you will need to connect using HTTPS rather than SSH (SRE considers SSH from the stat hosts a security risk because of the possibility that other users could access your SSH keys). To do this, you will need to set up a personal access token (GitLab docs, GitHub docs), which you will use in place of a password when using Git on the command line.

By default, you'll have to enter your username and password every time you push. You can avoid this by adding the following to ~/.gitconfig:

# Automatically add username to GitHub URLs
[url "https://{{username}}@github.com"]
    insteadOf = https://github.com

# Automatically add username to Wikimedia GitLab URLs
[url "https://{{username}}@gitlab.wikimedia.org"]
    insteadOf = https://gitlab.wikimedia.org

# Cache access tokens for 8 hours after entry
[credential]
    helper = cache --timeout=28800
"Open raw" button

Often, notebooks with interactive or HTML elements don't display well in GitLab or GitHub. You can solve this by sharing the Nbviewer link to the notebook.

For GitHub, the tool works fine with the blob version of a notebook, however for gitlab.wikimedia.org, it can only read the raw version. You can either change from blob to raw in the URL, or open the raw version (from the top bar) as shown in the image and copy the URL.

Quarto reports

Quarto is a tool for generating rich reports and presentations from Jupyter notebooks or plain Markdown files. It can generate a wide variety of output formats, including HTML, PDF, and MediaWiki markup.

Installing Quarto

  1. In a Jupyter terminal, run conda install quarto.
  2. In a Jupyter terminal, run python3 -m pip install jupyterlab-quarto==0.1.45.[2]
  3. Restart your server.[3]

Writing notebook for Quarto

Any Jupyter notebook works as input for Quarto: you code in code cells and write Markdown text in Markdown cells.

However, there are lots of things you can add that aren't understood by Jupyter but will make cool things happen when you render the notebook. For full details, see the Quarto guide. Good sections to start with include:

One particularly useful thing is adding "front matter" in a Markdown cell to set Quarto options. Here's a good starting point for outputting an HTML file:

---
title: "My report"
author:
  # Multiple authors can be added; each entry starts with `-`
  - name: Me
    url: https://meta.wikimedia.org/wiki/User:Me-WMF
    affiliation:
      - name: My Team, Wikimedia Foundation
        url: https://meta.wikimedia.org/wiki/My_Team
# Publication date
date: 2025-01-01
# Adds the date the file was last modified beside the publication date
date-modified: last-modified
license: "CC BY"
format:
  # Tells Quarto that the notebook should be rendered to HTML, so you don't need
  # to specify every time you run `quarto render`
  html:
      # Embeds Quarto's CSS and JS in the HTML file, which makes it easier to share the
      # output
      embed-resources: true
      # Adds a table of contents from the Markdown headings
      toc: true
      # Code cells start out collapsed
      code-fold: true
      code-links:
          - text: Source code
            icon: file-code
            href: https://gitlab.wikimedia.org/me/my-repo
---

For more HTML format options, see the HTML section of the guide

Rendering a notebook

To produce the output file from your notebook, you use the terminal command quarto render some_notebook.ipynb. If you didn't specify the output format in the notebook's front matter, you need to specify the format you want to produce in the terminal command by adding, for example, --to html.

This will output an HTML file named some_notebook.html. If you didn't set embed-resources: true in the front matter, there will also be a folder named some_notebook_files containing JavaScript and CSS files. This folder will need to stay alongside the HTML file for it to display and function correctly.

Previewing a notebook

If you have a lot of writing to do, you can set up a live preview of your output.

  1. In a Jupyter terminal, run: quarto preview {your_notebook}.ipynb --port 6513[4]
  2. In a terminal on your computer, run ssh -N {your_stat_host} -L 6513:127.0.0.1:6513
  3. Open http://localhost:6513 and you will see the HTML preview of your notebook.
  4. Work on your notebook. Every time you save, the preview will update.
  5. When you're done using the preview, shut it down by typing control + C in the Jupyter terminal.

Note that previewing involves Quarto essentially running quarto render over and over, so you don't need to re-render later unless you made changes to the notebook after shutting down the preview.

Troubleshooting

Server fails to start after creating a new environment

If you attempt to start your Jupyter server using the "create and use a new Conda environment" option, but the start-up fails, delete your .conda/pkgs/cache directory and try again (task T380477).

Server fails to start with error "No such file or directory: 'jupyterhub-singleuser'"

If you attempt to start your Jupyter server using an existing Conda environment, it may fail with the message Error: HTTP 500: Internal Server Error (Error in Authenticator.pre_spawn_start: FileNotFoundError [Errno 2] No such file or directory: 'jupyterhub-singleuser').

In this case, the environment has probably been deleted, but JupyterHub has not updated the list of environments. Try using a different environment or creating a new one. Once you start an environment successfully, JupyterHub will update the list and the deleted environment will disappear.

Trouble installing R packages

See Data Platform/Systems/R#Troubleshooting.

Browser disconnects

If your browser session disconnects from the kernel on the server (if, for example, your SSH connection times out), any work the kernel is doing will continue, and you'll be able to access the results the next time you connect to the kernel, but no further display output for that work (like print() commands to log progress) will accumulate, even if you reopen the notebook (JupyterLab issue 4237).

Notebook is unresponsive, or kernel restarts when running a large query

It may be that your Jupyter Notebook Server ran out of memory and the operating system's out of memory killer decided to kill your kernel to cope with the situation. You won't get any notification that this has happened other than the notebook being unresponsive or restarting, but you can assess the state of the memory on the notebook server by checking the stat hosts dashboard.

Sometimes trying to access the main interface at http://localhost:8880 will throw an HTTP 500 error. In these cases it may be possible to visit http://localhost:8880/hub/home and stop the server.

Viewing Jupyter Notebook Server logs

JupyterHub logs are viewable by normal users in Kibana.

A dashboard has been created named JupyterHub and this is also linked from the Home Dashboard.

At present the logs are not split per user, but we are working to make this possible.

They are no longer written by default to /var/log/syslog but they are retained on the host in the systemd journal.

You might need to see JupyterHub logs to troubleshoot login issues or resource issues affecting the cluster.

An individual user's notebook server log be examined with the following command

sudo journalctl -f -u jupyter-$USERNAME-singleuser.service

Viewing JupyterHub logs

TODO: Make this work for regular users!

You might need to see JupyterHub logs to troubleshoot login issues:

sudo journalctl -f -u jupyterhub

"An error occurred while trying to connect to the Java server"

If you see an error like this:

ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:43881)
Traceback (most recent call last):
  File "/usr/lib/spark3/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 977, in _get_connection
    connection = self.deque.pop()
IndexError: pop from an empty deque

Try each of these options, one at time:

  1. restart your notebook kernel (Menu -> Kernel -> Restart kernel)
  2. restart your JupyterHub server (follow the steps for changing environments, but use the same environment)
  3. create and use a brand new environment (follow the steps given previously, but select "create and use new cloned conda environment..."

Tips

Sending emails from within a notebook

To send out an email from a Python notebook (e.g. as a notification that a long-running query or calculation has completed), you can use the following code:

hostname = !hostname
server = hostname[0] + '.eqiad.wmnet'

whoami = !whoami
user = whoami[0]

from email.message import EmailMessage
import smtplib

def send_email(
    subject,
    body,
    to_email=user+'@wikimedia.org',
    from_email=user+'@'+server
):
    smtp = smtplib.SMTP("localhost")
    
    message = EmailMessage()
    message.set_content(body)
    
    message['From'] = from_email
    message['To'] = to_email
    message['Subject'] = subject
    
    smtp.send_message(message)

(Invoking the standard mail client via the shell, i.e. !mailx, fails for some reason. See phab:T168103.)

Shell shortcuts

To make it easier to access the stat hosts you can add entries like

Host stat11
    HostName stat1011.eqiad.wmnet

to ~/.ssh/config so that connecting to stat1011.eqiad.wmnet is as easy as ssh stat11 You can also make it easier to open SSH tunnels without remembering the full command. For example, if you are using Z shell you can add a tunnel function

tunnel() {
    ssh -N $1 -L 8880:127.0.0.1:8880
}

to ~/.zshrc so that opening a tunnel to stat1011.eqiad.wmnet – and assuming you added the appropriate entries to your SSH config – is as easy as tunnel stat11

Administration

Data_Platform/Systems/Jupyter/Administration

References

  1. If you later want to open a plain interactive SSH session with one of the analytics clients, you can still do this by using its full name: ssh stat1008.eqiad.wmnet.
  2. This installs Quarto's JupyterLab extension, which makes Quarto-specific markup display more nicely in notebook editing mode. The version specification is required because we are running JupyterLab 3, which is not the latest version.
  3. During the start-up, various scripts will be run which are necessary for Quarto to run correctly. You can run the scripts manually using eval "$(conda shell.bash activate)", but that will only apply to that individual terminal window.
  4. The specific port number doesn't matter; if you don't provide it, Quarto will pick one randomly. These instructions use a fixed number for simplicity.