Data Platform/Internal API requests
This page documents how to query MediaWiki Action API, MediaWiki REST API, and Wikimedia REST API internally in R and Python, rather than sending requests over the Internet. The code examples here were tested on stat1010.eqiad.wmnet.
Both R and Python approaches assume that HTTPS_PROXY
, https_proxy
, NO_PROXY
, and no_proxy
environment variables are already set. Refer to HTTP proxy for setting them manually if they get unset.
In some cases, it is preferable to make use of the Service Proxy, rather than using a connection to a discovery.wmnet hostname. Doing so would require the envoy proxy to be installed, as well as a listener configured for each API service that you wish to use.
Python
Fixing SSL certificate verification error
To avoid
SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate
when running the code from a virtual environment (e.g. conda-analytics environment in JupyterHub on stat hosts), use:
import os
os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'
Thank you Ben Tullis for figuring this out.[1]
Using requests
With the requests library:
import requests
url = 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php'
headers = {'Host': 'en.wikipedia.org'}
payload = {
'action': 'query',
'prop': 'info',
'titles': 'R_(programming_language)|Python_(programming_language)',
'format': 'json'
}
resp = requests.get(url, headers=headers, params=payload).json()
Using mwapi
With the mwapi library (which also requires REQUESTS_CA_BUNDLE
environment variable):
import mwapi
session = mwapi.Session(host = 'https://mw-api-int-ro.discovery.wmnet:4446')
session.headers['Host'] = 'en.wikipedia.org'
resp = session.get(
action = 'query',
prop='info',
titles = 'R_(programming_language)|Python_(programming_language)'
)
Thank you to Lucas Werkmeister for figuring this out.[2]
DataFrame from API response
To convert the response into a nice data frame we can use from_dict from pandas:
import pandas as pd
page_info = pd.DataFrame.from_dict(resp['query']['pages'], orient='index')
pageid | ns | title | contentmodel | pagelanguage | pagelanguagehtmlcode | pagelanguagedir | touched | lastrevid | length | |
---|---|---|---|---|---|---|---|---|---|---|
23862 | 23862 | 0 | Python (programming language) | wikitext | en | en | ltr | 2022-05-17T15:11:33Z | 1088356878 | 146500 |
376707 | 376707 | 0 | R (programming language) | wikitext | en | en | ltr | 2022-05-16T17:53:29Z | 1087609113 | 59925 |
R
Using httr2 package:
library(httr2)
req <- request("https://mw-api-int-ro.discovery.wmnet:4446/w/api.php") %>%
req_headers("Host" = "en.wikipedia.org")
req <- req %>%
req_url_query(
action = "query",
prop = "info",
titles = "R_(programming_language)|Python_(programming_language)",
format = "json"
)
# Fix error "SSL certificate problem: unable to get local issuer certificate":
req <- req %>%
req_options(ssl_verifypeer = 0)
# Perform the request:
resp <- req %>%
req_perform() %>%
resp_body_json()
To convert the response into a nice data frame we can use map_dfr from purrr and as_tibble from tibble:
library(tidyverse)
page_info <- resp$query$pages %>%
map_dfr(as_tibble)
pageid | ns | title | contentmodel | pagelanguage | pagelanguagehtmlcode | pagelanguagedir | touched | lastrevid | length |
---|---|---|---|---|---|---|---|---|---|
<int> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <int> | <int> |
23862 | 0 | Python (programming language) | wikitext | en | en | ltr | 2022-05-17T15:11:33Z | 1088356878 | 146500 |
376707 | 0 | R (programming language) | wikitext | en | en | ltr | 2022-05-16T17:53:29Z | 1087609113 | 59925 |
See Also
Service Proxy - The standard framework for inter-service communication.
Noc.wikimedia.org - A service that can provide authoritiative configuration references.