Jump to content

Data Platform/Internal API requests

From Wikitech

This page documents how to query MediaWiki Action API, MediaWiki REST API, and Wikimedia REST API internally in R and Python, rather than sending requests over the Internet. The code examples here were tested on stat1010.eqiad.wmnet.

Both R and Python approaches assume that HTTPS_PROXY, https_proxy, NO_PROXY, and no_proxy environment variables are already set. Refer to HTTP proxy for setting them manually if they get unset.

In some cases, it is preferable to make use of the Service Proxy, rather than using a connection to a discovery.wmnet hostname. Doing so would require the envoy proxy to be installed, as well as a listener configured for each API service that you wish to use.

Python

Fixing SSL certificate verification error

To avoid

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate

when running the code from a virtual environment (e.g. conda-analytics environment in JupyterHub on stat hosts), use:

import os

os.environ['REQUESTS_CA_BUNDLE'] = '/etc/ssl/certs/ca-certificates.crt'

Thank you Ben Tullis for figuring this out.[1]

Using requests

With the requests library:

import requests

url = 'https://mw-api-int-ro.discovery.wmnet:4446/w/api.php'

headers = {'Host': 'en.wikipedia.org'}

payload = {
    'action': 'query',
    'prop': 'info',
    'titles': 'R_(programming_language)|Python_(programming_language)',
    'format': 'json'
}

resp = requests.get(url, headers=headers, params=payload).json()

Using mwapi

With the mwapi library (which also requires REQUESTS_CA_BUNDLE environment variable):

import mwapi

session = mwapi.Session(host = 'https://mw-api-int-ro.discovery.wmnet:4446')
session.headers['Host'] = 'en.wikipedia.org'

resp = session.get(
    action = 'query',
    prop='info',
    titles = 'R_(programming_language)|Python_(programming_language)'
)

Thank you to Lucas Werkmeister for figuring this out.[2]

DataFrame from API response

To convert the response into a nice data frame we can use from_dict from pandas:

import pandas as pd

page_info = pd.DataFrame.from_dict(resp['query']['pages'], orient='index')
pageid ns title contentmodel pagelanguage pagelanguagehtmlcode pagelanguagedir touched lastrevid length
23862 23862 0 Python (programming language) wikitext en en ltr 2022-05-17T15:11:33Z 1088356878 146500
376707 376707 0 R (programming language) wikitext en en ltr 2022-05-16T17:53:29Z 1087609113 59925

R

Using httr2 package:

library(httr2)

req <- request("https://mw-api-int-ro.discovery.wmnet:4446/w/api.php") %>%
    req_headers("Host" = "en.wikipedia.org")

req <- req %>%
    req_url_query(
        action = "query",
        prop = "info",
        titles = "R_(programming_language)|Python_(programming_language)",
        format = "json"
    )

# Fix error "SSL certificate problem: unable to get local issuer certificate":
req <- req %>%
    req_options(ssl_verifypeer = 0)

# Perform the request:
resp <- req %>%
    req_perform() %>%
    resp_body_json()

To convert the response into a nice data frame we can use map_dfr from purrr and as_tibble from tibble:

library(tidyverse)

page_info <- resp$query$pages %>%
    map_dfr(as_tibble)
A tibble: 2 × 10
pageid ns title contentmodel pagelanguage pagelanguagehtmlcode pagelanguagedir touched lastrevid length
<int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <int> <int>
23862 0 Python (programming language) wikitext en en ltr 2022-05-17T15:11:33Z 1088356878 146500
376707 0 R (programming language) wikitext en en ltr 2022-05-16T17:53:29Z 1087609113 59925

See Also

Service Proxy - The standard framework for inter-service communication.

Noc.wikimedia.org - A service that can provide authoritiative configuration references.

References

  1. T361024#9662135
  2. https://github.com/mediawiki-utilities/python-mwapi/issues/45