Data Platform/Internal API requests

This page documents how to query MediaWiki Action API, MediaWiki REST API, and Wikimedia REST API internally in R and Python, rather than sending requests over the Internet. The code examples here were tested on stat1010.eqiad.wmnet.

Both R and Python approaches assume that HTTPS_PROXY, https_proxy, NO_PROXY, and no_proxy environment variables are already set. Refer to HTTP proxy for setting them manually if they get unset.

In some cases, it is preferable to make use of the Service Proxy, rather than using a connection to a discovery.wmnet hostname. Doing so would require the envoy proxy to be installed, as well as a listener configured for each API service that you wish to use.

Python

Fixing SSL certificate verification error

To avoid

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate

when running the code from a virtual environment (e.g. conda-analytics environment in JupyterHub on stat hosts), use:

import os

os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'

Thank you Ben Tullis for figuring this out.^[1]

Using requests

With the requests library:

import requests

url = 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php'

headers = {'Host': 'en.wikipedia.org'}

payload = {
    'action': 'query',
    'prop': 'info',
    'titles': 'R_(programming_language)|Python_(programming_language)',
    'format': 'json'
}

resp = requests.get(url, headers=headers, params=payload).json()

Using mwapi

With the mwapi library (which also requires REQUESTS_CA_BUNDLE environment variable):

import mwapi

session = mwapi.Session(host = 'https://mw-api-int-ro.discovery.wmnet:4446')
session.headers['Host'] = 'en.wikipedia.org'

resp = session.get(
    action = 'query',
    prop='info',
    titles = 'R_(programming_language)|Python_(programming_language)'
)

Thank you to Lucas Werkmeister for figuring this out.^[2]

DataFrame from API response

To convert the response into a nice data frame we can use from_dict from pandas:

import pandas as pd

page_info = pd.DataFrame.from_dict(resp['query']['pages'], orient='index')

	pageid	ns	title	contentmodel	pagelanguage	pagelanguagehtmlcode	pagelanguagedir	touched	lastrevid	length
23862	23862	0	Python (programming language)	wikitext	en	en	ltr	2022-05-17T15:11:33Z	1088356878	146500
376707	376707	0	R (programming language)	wikitext	en	en	ltr	2022-05-16T17:53:29Z	1087609113	59925

R

Using httr2 package:

library(httr2)

req <- request("https://mw-api-int-ro.discovery.wmnet:4446/w/api.php") %>%
    req_headers("Host" = "en.wikipedia.org")

req <- req %>%
    req_url_query(
        action = "query",
        prop = "info",
        titles = "R_(programming_language)|Python_(programming_language)",
        format = "json"
    )

# Fix error "SSL certificate problem: unable to get local issuer certificate":
req <- req %>%
    req_options(ssl_verifypeer = 0)

# Perform the request:
resp <- req %>%
    req_perform() %>%
    resp_body_json()

To convert the response into a nice data frame we can use map_dfr from purrr and as_tibble from tibble:

library(tidyverse)

page_info <- resp$query$pages %>%
    map_dfr(as_tibble)

A tibble: 2 × 10
pageid	ns	title	contentmodel	pagelanguage	pagelanguagehtmlcode	pagelanguagedir	touched	lastrevid	length
<int>	<int>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<int>	<int>
23862	0	Python (programming language)	wikitext	en	en	ltr	2022-05-17T15:11:33Z	1088356878	146500
376707	0	R (programming language)	wikitext	en	en	ltr	2022-05-16T17:53:29Z	1087609113	59925

References

[1] T361024#9662135

[2] ttps://github.com/mediawiki-utilities/python-mwapi/issues/45

[1]

[2]