Data Platform/Systems/R
R can be used for data analysis on the stat hosts, both inside and outside Jupyter.
Installing
In Jupyter
Open a Jupyter terminal and run the following commands:
conda install r-base r-irkernel
R --no-echo -e "IRkernel::installspec()"
Outside Jupyter
Activate a Conda environment, then follow the instructions above.
Installing packages
When you want to install a package, first try installing it using Conda.
Conda prefixes R package names with r-
. For example, for the Tidyverse, you run conda install r-tidyverse
and for BRMS, you run conda install r-brms
.
If installing with Conda doesn't work, you can use R's package manager by running install.packages()
during an R session.
If you often need to install packages with R's package manager, it's helpful to create an ~/.Rprofile
file with the following:
options(
repos = c(
# CRAN mirror with automatic server redirection
CRAN = "https://cloud.r-project.org",
# Additional repo for STAN packages
STAN = "https://stan-dev.r-universe.dev"
)
)
Sys.setenv(MAKEFLAGS = "-j4")
Sys.setenv(DOWNLOAD_STATIC_LIBV8 = 1)
Querying the data lake and MariaDB in R
It's possible to use the Reticulate R package to access Python, so with it, you can query the data lake using Wmfdata-Python.
First, install Reticulate:
conda install r-reticulate
Wmfdata-Python is installed by default in your Conda environment, so you don't need to install it. You should then be able to run the following R code to connect Reticulate to your active Conda environment and import Wmfdata:
library(reticulate)
Sys.getenv("CONDA_PREFIX") |> use_condaenv()
wmf <- import('wmfdata')
You should then be able to use the various Wmfdata backends for queries. For example you can use wmf$spark$run()
to run a Spark query.
Tips
New features
Since the version of R coming from Conda-Forge is 4.2 (or newer) we now have access to newer features such as a new syntax for specifying strings and a built-in pipe operator (|>
) – replacing the need for magrittr's %>%
.
Style guide
Consider following the Product Analytics R style guide in your code to promote standardization and easy code review and sharing.
Troubleshooting
Conda encounters a "LibMambaUnsatisfiableError" when installing Arrow or Stringi
For some reason, Conda encounters an unsolvable dependency conflict when trying to install both r-arrow
and r-stringi
(phab:T391911). Many packages depend on Stringi (including Tidyr, a core Tidyverse library) so this can come up in many different situations.
The best workaround is not to install Arrow or, if it's already installed, to uninstall it. You can use Nanoparquet (conda install r-nanoparquet
) instead to read Parquet files.