Machine Learning/ML-Lab

This page collects documentation on how to use the ML-team's lab machines (ml-lab1001.eqiad.wmnet and ml-lab1002.eqiad.wmnet).

NOTE: If you have feedback regarding the process, documentation and how well things work, don't hesitate to contact us on IRC (#wikimedia-ml), Slack (#ml-lab-user-group) or via a Phabricator task.

Machine overview

The two machines are identical in hardware configuration:

1x AMD EPYC 7643P 48-Core Processor (96 threads)
348GiB of RAM
2x AMD Instinct MI210 (Aldebaran)
~275GiB of SSD for storage (this will be expanded soon)

Access to the machines

Machine access is via the ml-lab-users group in Puppet and usually contains ML-team, Researchers and similar people. If you need access, talk to Chris Albon

SSH logins work along the same lines as the statboxes, but these machines have no access to the Data lake, Hadoop etc, and there is no Kerberos.

Outside access/downloads/proxies

See HTTP proxy#How-to? for how to enable/disable outside web access for PIP/Huggingface/... downloads and the like.

PyTorch and ROCm

Since the machines have AMD GPUs, the way to use them for acceleration is via the ROCm set of libraries. Typically, people will use PyTorch's ROCm variant to do so. Note: if you need a different library for some reason, please contact ML-Team first, as there are disk usage considerations.

To get a Python venv with PyTorch-ROCm, you create a new venv, and use PYTHONPATH to use the ROCm-enabled PyTorch libraries in /srv:

 $ mkdir myproject
 $ cd myproject/
 $ python3 -m venv venv
 $ source venv/bin/activate
 $ export "PYTHONPATH=/srv/pytorch-rocm/venv/lib/python3.11/site-packages/:$PYTHONPATH"
 $ python3
 Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import torch
 >>> torch.cuda.is_available()
 True

Huggingface Cache

It is strongly recommended to use the shared Huggingface cache, since disk space is still limited (it also saves on download time, even with the caching proxy).

To use the cache automatically when dealing with Huggingface code, set the corresponding environment variable:

export HF_HOME="/srv/hf-cache/"

Jupyter

With the information above, we can make a script to run Jupyter notebooks. First, we set up the environment like above:

 $ python3 -m venv venv
 $ source venv/bin/activate
 $ export "PYTHONPATH=/srv/pytorch-rocm/venv/lib/python3.11/site-packages/:$PYTHONPATH"

Then we create a requirements.txt file that has the dependencies for JL Notebooks:

accelerate==1.1.0
bitsandbytes==0.44.1
huggingface-hub==0.26.2
jupyterlab==4.3.0
ipywidgets==8.1.5
optimum==1.23.3
pandas==2.2.3
transformers==4.46.1

We install these the usual way:

 $ set_proxy
 $ pip install -r requirements.txt

NOTE: Without the PYTHONPATH set up correctly, this will download a version of PyTorch that bundles the wrong (nvidia) libraries, which will not only not work, but also waste several gigabytes of disk space.

We can then run a script like this to launch the JL Notebook server:

#!/bin/bash

# Some variables, should you need to tweak stuff
PYTORCHPATH="/srv/pytorch-rocm/venv/lib/python3.11/site-packages/"
SHARED_HFHOME="/srv/hf-cache/"
JUPLABPORT=8889
VENV=venv

set -e

if [ ! -r "$VENV"/bin/activate ]; then
    echo "$VENV/bin/activate does not exist, exiting" >&2
    exit 1
fi

source "$VENV/bin/activate"

if [ -z "$PYTHONPATH" ]; then
    echo "Setting PYTHONPATH to $PYTORCHPATH"
    export PYTHONPATH="$PYTORCHPATH"
else
    echo "Prefixing PYTHONPATH with $PYTORCHPATH"
    export PYTHONPATH="$PYTORCHPATH:$PYTHONPATH"
fi

echo "Setting HF_HOME to $SHARED_HFHOME"
export HF_HOME="$SHARED_HFHOME"
export CUDA_VISIBLE_DEVICES=1

# Prompt for Hugging Face login
read -rp "Do you want to log in to Hugging Face? (y/n, default=n): " hf_login
hf_login=${hf_login:-n}  # Default to 'n' if no input

if [[ "$hf_login" == "y" ]]; then
    huggingface-cli login
else
    echo "Skipping Hugging Face login"
fi

# Start Jupyter Lab
echo "Starting Jupyter Lab on port $JUPLABPORT"
jupyter lab --no-browser --port="$JUPLABPORT"

When the Jupyter Server is up, you will see something like:

[C 2024-11-14 10:46:41.272 ServerApp] 
    
    To access the server, open this file in a browser:
        file:///srv/home/aikochou/.local/share/jupyter/runtime/jpserver-633556-open.html
    Or copy and paste one of these URLs:
        http://localhost:8889/lab?token=3cc2922e89194d0b42a03225da447a78e5dda7b5d528fa45
        http://127.0.0.1:8889/lab?token=3cc2922e89194d0b42a03225da447a78e5dda7b5d528fa45

Remember your token, you will need it in the next step.

SSH tunnel

Open another terminal and use the following command to open a SSH tunnel to ml-lab:

$ssh -L <LOCAL_PORT_YOU_WANT>:localhost:<JUPLAB_PORT> ml-lab1001.eqiad.wmnet

and open the following URL in your browser:

http://localhost:<LOCAL_PORT_YOU_WANT>/lab?token=<TOKEN>