Data Platform/Systems/Conda

We use Conda to manage packages and virtual environments on the stat hosts.

Environments are created by cloning Conda-Analytics, a custom Conda distribution maintained by the Data Platform Engineering team.

Use with Jupyter

For instructions on using Conda within our Jupyter environment, see Data Platform/Systems/Jupyter#Conda environments.

Use outside Jupyter

This section applies to Conda use outside of Jupyter (that is, when you connect to one of the analytics clients with a plain SSH terminal session).

In most cases, you can use the standard Conda commands (e.g. conda install, conda remove, conda list, conda deactivate). This section covers the exceptions where we have custom commands to support our cloning-based workflow.

Creating a new environment

In the terminal, run conda-analytics-clone and a new clone of conda-analytics will be created for you in ~/.conda/envs.

It will be automatically named with the time and your username. If you prefer, you can give it a custom name: conda-analytics-clone my-cool-env.

Listing environments

$ conda-analytics-list
# conda environments:
#
2022-11-04T19.32.00_xcollazo     /home/xcollazo/.conda/envs/2022-11-04T19.32.00_xcollazo
2022-11-08T15.39.32_xcollazo     /home/xcollazo/.conda/envs/2022-11-08T15.39.32_xcollazo
2022-11-09T20.10.01_xcollazo     /home/xcollazo/.conda/envs/2022-11-09T20.10.01_xcollazo
base                  *  /opt/conda-analytics

Activating an environment

Run source conda-analytics-activate my-cool-env.

You can achieve the same thing with vanilla commands:

$ source /opt/conda-analytics/etc/profile.d/conda.sh
$ conda activate my-cool-env

You can also activate the read-only base environment, run source conda-analytics-activate base.

Installing packages

With a Conda environment activated, you can install packages by running conda install {{package}} in the terminal. If you are using Conda outside of Jupyter, you will first have to set your environment to use the HTTP proxy.

Conda will install packages from the Conda Forge channel by default. You can manually select a different channel by adding --channel {{channel}} to the command. The easiest way to search Conda Forge for a specific package is to do a regular web search with the qualifier "site:anaconda.org/conda-forge/".

If a Python package you need is not available from Conda Forge, you can use Pip instead.

Pinned package management

Each cloned environment comes with a pinned file whose main purpose is to prevent core packages from being automatically upgraded.

The pinned file is located in the conda-meta directory of each environment and can be customised to your liking

Troubleshooting

Spark 3 insert statement requirements

Using an INSERT statement in Spark 3 SQL or write.insertInto() in PySpark 3 results in the environment's Python executable being called. If the code is run from a cron job that loads a custom Python environment this might result in errors being thrown because that executable isn't available on the cluster. One way to solve this is to use wmfdata.spark.create_session(ship_python_env = True) to create a custom Spark session that ships the Python environment to the cluster nodes.

R support

R is not included by default, but can easily be installed. See Data Platform/Systems/R for details.

Administration

The Conda-Analytics base environment is based on Miniconda and has extra packages specific for our needs as well as scripts for cloning the environment. On the stat hosts, it is available in /opt/conda-analytics.

The code used to build new releases of conda-analytics lives in gitlab:repos/data-engineering/conda-analytics/. The actual releases live in the associated package registry.