Jump to content

Performance/Synthetic testing/OnDemandTesting/Runbook

From Wikitech

This is the runbook for investigating and fixing problems with the on demand testing.

Meta

Debug missing metrics direct tests

The direct tests against Wikipedia goes through the direct test infrastructure. If we miss out on metrics in Grafana/Graphite from the direct tests it could be either be problems on the server that runs the tests, Graphite or the infrastructure setup for the on demand testing.

All services are setup up to restart on failures so in only worst case scenario the systems should be down.

The most important thing for everything to work is the queue system. If the queue is down, no tests can be added and no tests can be picked up and run.

Is the queue system working?

Log into the machine that runs the queue: ssh queue.webperformancetest.eqiad1.wikimedia.cloud

Then change user: sudo su - queue

Verify that the container (a keydb container) is up and running with docker ps

If the system is not up you need to look into the logs to try to understand what's wrong.

Logs from systemctl: sudo journalctl -u keydb-queue.service -n 100

Logs from the docker container: docker logs queue-keydb-1

If you you need to stop/start/restart the service you should use systemctl: sudo systemctl restart keydb-queue

Is the database up and running?

Log into the machine that runs the queue: ssh db.webperformancetest.eqiad1.wikimedia.cloud

Then change user: sudo su - database

Verify that the container (a PostgreSQL container) is up and running with docker ps

If it's not running you need to look in the logs and try to understand what's gone wrong:

Logs from the servicesudo journalctl -u database.service -n 100

Logs from the container:docker logs database-postgresql-1

If you you need to stop/start/restart the service you should use systemctl: sudo systemctl restart database.service

Is the testrunner/API that takes on jobs running?

Start by verifying that the tests actually runs. Log into the server ssh USERNAME@138.201.135.103 and switch to the correct user su - sitespeedio

Then you can check the logs what's going on. The stdout log should continuously add log entries about tests that continues to run. Tail the log for a while and see what is happening: tail -f /var/log/sitespeedio/testrunner-stdout.log

You can also look for errors in the the stderr log: tail -f /var/log/sitespeedio/testrunner-stderr.log

Debug ondemand testing GUI

You can check if the GUI is down by accessing https://wikiperformance.wmcloud.org/ in your web browser. You can also see if the database is working by accessing https://wikiperformance.wmcloud.org/search/ and scroll down to see tests that have finished. If you see test data here, you know that the database is working.

The GUI frontend run on the database server (the frontend is lightweight). If the GUI is down/not working it's common that the queue or the database is down. If you checked those two and they are working, you can investigate the GUI.

You can check the log using tail -f /var/log/sitespeed.io/server-stdout.log

If the GUI service needs to be restarted you should use: sudo systemctl restart sitespeedserver.service

Update to a new version of the server/testrunner

The server and the testrunner runs Dockerised version and will autoupdate for all minor versions. If you need to update to a new major version, you need to stop the service, update the start script, reload the script and then restart.

Update the server

Make sure to read the changelog if you update a major version so you know if something needs to be changed.

Edit the service file:

sudo nano /etc/systemd/system/sitespeedserver.service

Change the major version number of the Docker container. In the start file it looks like this: sitespeedio/server:1

Then reload the configuration: sudo systemctl daemon-reload

And then restart the service: sudo systemctl restart sitespeedserver.service

Update the testrunner

Make sure to read the changelog if you update a major version so you know if something needs to be changed.

Edit the service file:

sudo nano /etc/systemd/system/sitespeedtestrunner.service

Change the major version number of the Docker container. In the start file it looks like this: sitespeedio/testrunner:1

Then reload the configuration: sudo systemctl daemon-reload

And then restart the service: sudo systemctl restart sitespeedtestrunner.service