WMDE/Wikidata/SSR Service
This page provides a brief overview of Server-side Rendering Service[1].
Observability
- Grafana dashboard for termbox service
- Grafana dashboard for envoy proxy, filtered for termbox
- Grafana dashboard for Termbox SSR Service Level Objective (SLO)
- Grafana dashboard for Wikidata alerts with a panel showing Termbox request errors (requests from MediaWiki to Termbox)
- Logstash, Logstash 2 (Todo: create some gadgets to see at a glance whether events are spiking, maybe consolidate this)
Details
Overview
The service was introduced in 2019, to initially serve server-side rendered content of the Wikidata/Wikibase "term box", i.e. the part of item/property page UI where labels, descriptions and aliases are shown and could be edited.
The service is used as part of generating the HTML output sent from MediaWiki to user's browser.
The HTML generated server-side is to be optionally "enhanced" by client-side JavaScript
There is a server-side and the client-side variant of the code, which are distributions of the same implementation.
The client-side variant is deployed into wikibase on a file system level through git submodules.
In case of no configured server-side rendering service or a malfunctioning of it, the client-side code will act as a fallback.[2]
Technology
The SSR service is a node service. It is written in TypeScript. The code is "compiled" to JavaScript using webpack. The "compiled" code and "compiled" CSS is to be found in the dist folder of the git repository.
The service uses Vue.js as the UI framework.
The service is deployed on the WMF services Kubernetes cluster using helm. This means that the service is packaged as a docker image. The docker image is built by the Deployment pipeline.
Deployment
The images that are used in production can be found on the WMF docker registry. New images are built, after code is merged to the master branch, automatically by the deployment pipeline.
On Beta, the image is just run by Docker. The configuration for this can be found in the git repo in the infrastructure folder. The instructions for applying those changes can also be found there.
In Wikimedia production, the service is managed using Kubernetes and Helm. Kubernetes deployments are configured in the operations/deployment-charts repo. There are four releases in total:
- 2
production
releases, one for theeqiad
cluster and one forcodfw
. These talk to Wikidata (wikidata.org, wikidatawiki) and are used by Wikidata as well. - 1
staging
release, in thestaging
cluster. This one also talks to Wikidata, but is not used by anything. - 1
test
release, also in thestaging
cluster. This one talks to Test Wikidata (test.wikidata.org, testwikidatawiki) and is used by Test Wikidata as well.
When deploying a new version of the Termbox, you should usually first update the test
release (values-test.yaml
) and deploy that to the staging
cluster, then test that it works on Test Wikidata (check that a newly created item has an SSR termbox). Then, update the version in the production
release (values.yaml
; this will also update the staging
release, because values-staging.yaml
does not override the version). If you want to test the staging
release before deploying the production
release, you will have to do so using curl, because the staging
release is not used by any wiki:
curl 'https://staging.svc.eqiad.wmnet:4004/termbox?entity=Q42&revision=1841500264&language=en&editLink=%2Fw%2Findex.php%2FSpecial%3ASetLabelDescriptionAliases%2FQ42&preferredLanguages=en%7Cde'; echo
# should return some HTML starting with <section class="wikibase-entitytermsview"
If this works, then deploy the production
release to the eqiad
and codfw
clusters and check that new Wikidata items have an SSR termbox on mobile.
Some useful metrics for monitoring the deployment can be found shown in grafana.
Architecture
Sequence diagram "source code".
Initial deployment & load details
The initial responsibility of this service will be the rendering of the term box for wikidata items and properties for mobile web views.
Currently wikidata.org gets no more that 80k[3] mobile web requests per day (including cached pages, and non item/property pages).
If we were to assume all of these requests were actually to item and property pages that were not cached this would result in this SSR service being hit 55 times per minute.
(In reality some of these page views are not to item or property pages, and some will be cached) so we are looking at no more than 1 call per second.
Availability objectives and accepted operational errors
The Service Level Objective (SLO) for the Termbox SSR is an error rate of less than 0.1%. The current error rate and numbers of errors can be seen at the Grafana Termbox SSR SLO dashboard.
That availability is impacted by errors triggered inside Termbox SSR (i.e. the NodeJS app living in Kubernetes) that are caused by operational or performance issues in MediaWiki. They are unavoidable to a degree and acceptable as long as their overall frequency stays low, see the SLO above. The bulk of those errors is constituted by the following three error messages:
timeout of 3000ms exceeded
- Some of these timeout errors seem to happen surprisingly often during the health checks that are run periodically (config, docs). This is judged to be strange but probably harmless.
- Disregarding the health checks that go to the unused datacenter above, these errors also seem to correspond almost perfectly to the errors logged in MediaWiki PHP logstash with the message
Wikibase\View\Termbox\Renderer\TermboxRemoteRenderer: Problem requesting from the remote server
and contentRequest failed with status 0. Usually this means network failure or timeout
- That timeout for this connection going out from MediaWiki/PHP to the Termbox SSR is currently based on the wikibase default configuration
Request failed with status code 500
- i.e., the MediaWiki API having some server problem.
Request failed with status code 503
- These seem to be triggered by the Envoy Proxy that sits between the Termbox SSR and the MediaWiki API. More detailed information about that is available in another Phabricator comment.
These errors are discussed in more detail in a Phabricator comment. Detailed descriptions of them are visible on logstash. Note that there seems to be a bug in how Prometheus calculates the numbers shown in Grafana, so they can diverge from what is shown in logstash.
Debugging and Testing Production
To connect to the production services for testing use ssh port forwarding as follows:
ssh -4 -L 3030:termbox.svc.codfw.wmnet:3030 <username>@bast1002.wikimedia.org
You can alter the bastion host as needed. You can also alter the service e.g. eqiad vs codfw.