Jump to content

Data Platform/Systems/R

From Wikitech

R can be used for data analysis on the stat hosts, both inside and outside Jupyter.

Installing

In Jupyter

Open a Jupyter terminal and run the following commands:

  1. conda install r-base r-irkernel
  2. R --no-echo -e "IRkernel::installspec()"

Outside Jupyter

Activate a Conda environment, then follow the instructions above.

Installing packages

When you want to install a package, first try installing it using Conda.

Conda prefixes R package names with r-. For example, for the Tidyverse, you run conda install r-tidyverse and for BRMS, you run conda install r-brms.

If installing with Conda doesn't work, you can use R's package manager by running install.packages() during an R session.

If you often need to install packages with R's package manager, it's helpful to create an ~/.Rprofile file with the following:

options(
  repos = c(
    # CRAN mirror with automatic server redirection
    CRAN = "https://cloud.r-project.org",
    # Additional repo for STAN packages
    STAN = "https://stan-dev.r-universe.dev"
  )
)
Sys.setenv(MAKEFLAGS = "-j4")
Sys.setenv(DOWNLOAD_STATIC_LIBV8 = 1)

Querying the data lake and MariaDB in R

It's possible to use the Reticulate R package to access Python, so with it, you can query the data lake using Wmfdata-Python.

First, install Reticulate:

conda install r-reticulate

Wmfdata-Python is installed by default in your Conda environment, so you don't need to install it. You should then be able to run the following R code to connect Reticulate to your active Conda environment and import Wmfdata:

library(reticulate)

Sys.getenv("CONDA_PREFIX") |> use_condaenv()
wmf <- import('wmfdata')

You should then be able to use the various Wmfdata backends for queries. For example you can use wmf$spark$run() to run a Spark query.

Tips

New features

Since the version of R coming from Conda-Forge is 4.2 (or newer) we now have access to newer features such as a new syntax for specifying strings and a built-in pipe operator (|>) – replacing the need for magrittr's %>%.

Style guide

Consider following the Product Analytics R style guide in your code to promote standardization and easy code review and sharing.

Troubleshooting

Conda encounters a "LibMambaUnsatisfiableError" when installing Arrow or Stringi

For some reason, Conda encounters an unsolvable dependency conflict when trying to install both r-arrow and r-stringi (phab:T391911). Many packages depend on Stringi (including Tidyr, a core Tidyverse library) so this can come up in many different situations.

The best workaround is not to install Arrow or, if it's already installed, to uninstall it. You can use Nanoparquet (conda install r-nanoparquet) instead to read Parquet files.