Machine Learning/ML-Lab
This page collects documentation on how to use the ML-team's lab machines (ml-lab1001.eqiad.wmnet
and ml-lab1002.eqiad.wmnet
).
NOTE: If you have feedback regarding the process, documentation and how well things work, don't hesitate to contact us on IRC (#wikimedia-ml
), Slack (#ml-lab-user-group
) or via a Phabricator task.
Machine overview
The two machines are identical in hardware configuration:
- 1x AMD EPYC 7643P 48-Core Processor (96 threads)
- 348GiB of RAM
- 2x AMD Instinct MI210 (Aldebaran)
- ~275GiB of SSD for storage (this will be expanded soon)
Access to the machines
Machine access is via the ml-lab-users
group in Puppet and usually contains ML-team, Researchers and similar people. If you need access, talk to Chris Albon
SSH logins work along the same lines as the statboxes, but these machines have no access to the Data lake, Hadoop etc, and there is no Kerberos.
Outside access/downloads/proxies
See HTTP proxy#How-to? for how to enable/disable outside web access for PIP/Huggingface/... downloads and the like.
PyTorch and ROCm
Since the machines have AMD GPUs, the way to use them for acceleration is via the ROCm set of libraries. Typically, people will use PyTorch's ROCm variant to do so. Note: if you need a different library for some reason, please contact ML-Team first, as there are disk usage considerations.
To get a Python venv with PyTorch-ROCm, you create a new venv, and use PYTHONPATH
to use the ROCm-enabled PyTorch libraries in /srv
:
$ mkdir myproject
$ cd myproject/
$ python3 -m venv venv
$ source venv/bin/activate
$ export "PYTHONPATH=/srv/pytorch-rocm/venv/lib/python3.11/site-packages/:$PYTHONPATH"
$ python3
Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
Huggingface Cache
It is strongly recommended to use the shared Huggingface cache, since disk space is still limited (it also saves on download time, even with the caching proxy).
To use the cache automatically when dealing with Huggingface code, set the corresponding environment variable:
export HF_HOME="/srv/hf-cache/"
Jupyter
With the information above, we can make a script to run Jupyter notebooks. First, we set up the environment like above:
$ python3 -m venv venv
$ source venv/bin/activate
$ export "PYTHONPATH=/srv/pytorch-rocm/venv/lib/python3.11/site-packages/:$PYTHONPATH"
Then we create a requirements.txt
file that has the dependencies for JL Notebooks:
accelerate==1.1.0 bitsandbytes==0.44.1 huggingface-hub==0.26.2 jupyterlab==4.3.0 ipywidgets==8.1.5 optimum==1.23.3 pandas==2.2.3 transformers==4.46.1
We install these the usual way:
$ set_proxy
$ pip install -r requirements.txt
NOTE: Without the PYTHONPATH
set up correctly, this will download a version of PyTorch that bundles the wrong (nvidia) libraries, which will not only not work, but also waste several gigabytes of disk space.
We can then run a script like this to launch the JL Notebook server:
#!/bin/bash
# Some variables, should you need to tweak stuff
PYTORCHPATH="/srv/pytorch-rocm/venv/lib/python3.11/site-packages/"
SHARED_HFHOME="/srv/hf-cache/"
JUPLABPORT=8889
VENV=venv
set -e
if [ ! -r "$VENV"/bin/activate ]; then
echo "$VENV/bin/activate does not exist, exiting" >&2
exit 1
fi
source "$VENV/bin/activate"
if [ -z "$PYTHONPATH" ]; then
echo "Setting PYTHONPATH to $PYTORCHPATH"
export PYTHONPATH="$PYTORCHPATH"
else
echo "Prefixing PYTHONPATH with $PYTORCHPATH"
export PYTHONPATH="$PYTORCHPATH:$PYTHONPATH"
fi
echo "Setting HF_HOME to $SHARED_HFHOME"
export HF_HOME="$SHARED_HFHOME"
export CUDA_VISIBLE_DEVICES=1
# Prompt for Hugging Face login
read -rp "Do you want to log in to Hugging Face? (y/n, default=n): " hf_login
hf_login=${hf_login:-n} # Default to 'n' if no input
if [[ "$hf_login" == "y" ]]; then
huggingface-cli login
else
echo "Skipping Hugging Face login"
fi
# Start Jupyter Lab
echo "Starting Jupyter Lab on port $JUPLABPORT"
jupyter lab --no-browser --port="$JUPLABPORT"
When the Jupyter Server is up, you will see something like:
[C 2024-11-14 10:46:41.272 ServerApp]
To access the server, open this file in a browser:
file:///srv/home/aikochou/.local/share/jupyter/runtime/jpserver-633556-open.html
Or copy and paste one of these URLs:
http://localhost:8889/lab?token=3cc2922e89194d0b42a03225da447a78e5dda7b5d528fa45
http://127.0.0.1:8889/lab?token=3cc2922e89194d0b42a03225da447a78e5dda7b5d528fa45
Remember your token, you will need it in the next step.
SSH tunnel
Open another terminal and use the following command to open a SSH tunnel to ml-lab:
$ssh -L <LOCAL_PORT_YOU_WANT>:localhost:<JUPLAB_PORT> ml-lab1001.eqiad.wmnet
and open the following URL in your browser:
http://localhost:<LOCAL_PORT_YOU_WANT>/lab?token=<TOKEN>