Jump to content

Wikidata Query Service/Technical interactions

From Wikitech

Technical interactions with Query Services such as Wikidata Query Service (WDWS) and Wikimedia Commons Query Service (WCQS).

Context

Wikidata Query Service (WDQS) and Wikimedia Commons Query Service (WCQS) are SPARQL endpoints exposing the data of Wikidata and Wikimedia Commons respectively. They are publicly available at https://query.wikidata.org/ and https://commons-query.wikimedia.org/ respectively.

Both are currently backed by Blazegraph as an RDF Store, with plans to move to a different backend in the future (T206560).

WCQS is considered a Beta service at the moment, and has no SLO and no guarantee that interfaces will be stable (interfaces and endpoints might change in the future). WCQS requires authentication to access the service.

WDQS is considered a production service, but with an SLO that is by design much lower than the services usually exposed by Wikimedia Foundation. The current SLO is that the service should be available 95% of the time with an update lag < 10 minutes.

WDQS exposes both a public endpoint accessible by anyone on the internet and an internal endpoint accessible only inside of the Wikimedia datacenters. This internal endpoint is meant for integration with other Wikimedia components (mainly Mediawiki at this time).

Constraints on external endpoints

The usual mw:API:Etiquette applies. In particular, bots should use an identifiable user agent and provide contact information.

To prevent unintentional abuse of the service and help to provide fair access to our limited resources to all users, the queries are rate limited.

Users should be aware of the low availability of the service.

Queries are cached for a limited period of time (currently 5 minutes) to help absorb high spikes in traffic. As HTTP does not provide meaningful ways to negotiate cache duration, clients that might benefit from longer caching time or clients that don’t require up to date data should implement a caching mechanism on the client side.

Constraints on using WDQS/WCQS as part of the Wikimedia Foundation use cases

This section describes the constraints in using WDQS/WCQS as part of the Wikimedia Foundation offering, they apply to integration to both the public and internal endpoints. The constraints below are put in place to protect the user experience and the overall stability of the Wikimedia ecosystem. The same constraints make sense for use cases external to the Wikimedia Foundation, but it is left to the reuser to decide what makes or does not make sense in their specific context.

Query constraints on internal WDQS endpoint

The internal WDQS cluster should only be used for queries that have a known and low complexity. This helps ensure that response times have less variability and that load on the cluster is more predictable. User supplied query should never be run against that cluster. Queries constructed from user input are allowed if the user input does not have a major impact on the complexity of the query.

Asynchronous operations only

Given their low availability and high variability in response times, WDQS/WCQS should not be used for user synchronous user experiences. In this context, loading additional data from WDQS/WCQS on an HTML page, which is done asynchronously by the browser, is still considered a user synchronous operation: the user will need to wait for WDQS/WCQS to respond before having a complete user experience. Adding a caching layer does not transform a synchronous interaction in an asynchronous operation.

Valid interactions are fully asynchronous, via a batch process to precompute results that are stored before being proposed to users, or via stream processing or post processing hooks with appropriate error management. Clients need to assume that the service will be unavailable, degraded, or slow for extended periods.Failure of processing should not impact the user experience or appropriate retry mechanisms need to be implemented.

Limit impact on other components

Appropriate care must be taken so that WDQS failures do not affect downstream components. For example, if Mediawiki is depending on WDQS, the high variability of response time from WDQS could lead to a starvation of workers on the Mediawiki side. This needs to be mitigated by the use of pool counters, circuit breakers or other similar mechanisms.