Performance/Synthetic testing/OnDemandTesting/Runbook
This is the runbook for investigating and fixing problems with the on demand testing.
Meta
- Issue tracker (Phabricator): synthetic-performance-testing
- Documentation: OnDemandTesting
Debug missing metrics direct tests
The direct tests against Wikipedia goes through the direct test infrastructure. If we miss out on metrics in Grafana/Graphite from the direct tests it could be either be problems on the server that runs the tests, Graphite or the infrastructure setup for the on demand testing.
All services are setup up to restart on failures so in only worst case scenario the systems should be down.
The most important thing for everything to work is the queue system. If the queue is down, no tests can be added and no tests can be picked up and run.
Is the queue system working?
Log into the machine that runs the queue: ssh queue.webperformancetest.eqiad1.wikimedia.cloud
Then change user: sudo su - queue
Verify that the container (a keydb container) is up and running with docker ps
If the system is not up you need to look into the logs to try to understand what's wrong.
Logs from systemctl: sudo journalctl -u keydb-queue.service -n 100
Logs from the docker container: docker logs queue-keydb-1
If you you need to stop/start/restart the service you should use systemctl: sudo systemctl restart keydb-queue
Is the database up and running?
Log into the machine that runs the queue: ssh db.webperformancetest.eqiad1.wikimedia.cloud
Then change user: sudo su - database
Verify that the container (a PostgreSQL container) is up and running with docker ps
If it's not running you need to look in the logs and try to understand what's gone wrong:
Logs from the servicesudo journalctl -u database.service -n 100
Logs from the container:docker logs database-postgresql-1
If you you need to stop/start/restart the service you should use systemctl: sudo systemctl restart database.service
Is the testrunner/API that takes on jobs running?
Start by verifying that the tests actually runs. Log into the server ssh USERNAME@138.201.135.103
and switch to the correct user su - sitespeedio
Then you can check the logs what's going on. The stdout log should continuously add log entries about tests that continues to run. Tail the log for a while and see what is happening: tail -f /var/log/sitespeedio/testrunner-stdout.log
You can also look for errors in the the stderr log: tail -f /var/log/sitespeedio/testrunner-stderr.log
Debug ondemand testing GUI
You can check if the GUI is down by accessing https://wikiperformance.wmcloud.org/ in your web browser. You can also see if the database is working by accessing https://wikiperformance.wmcloud.org/search/ and scroll down to see tests that have finished. If you see test data here, you know that the database is working.
The GUI frontend run on the database server (the frontend is lightweight). If the GUI is down/not working it's common that the queue or the database is down. If you checked those two and they are working, you can investigate the GUI.
You can check the log using tail -f /var/log/sitespeed.io/server-stdout.log
If the GUI service needs to be restarted you should use: sudo systemctl restart sitespeedserver.service
Update to a new version of the server/testrunner
The server and the testrunner runs Dockerised version and will autoupdate for all minor versions. If you need to update to a new major version, you need to stop the service, update the start script, reload the script and then restart.
Update the server
Make sure to read the changelog if you update a major version so you know if something needs to be changed.
Edit the service file:
sudo nano /etc/systemd/system/sitespeedserver.service
Change the major version number of the Docker container. In the start file it looks like this: sitespeedio/server:1
Then reload the configuration: sudo systemctl daemon-reload
And then restart the service: sudo systemctl restart sitespeedserver.service
Update the testrunner
Make sure to read the changelog if you update a major version so you know if something needs to be changed.
Edit the service file:
sudo nano /etc/systemd/system/sitespeedtestrunner.service
Change the major version number of the Docker container. In the start file it looks like this: sitespeedio/testrunner:1
Then reload the configuration: sudo systemctl daemon-reload
And then restart the service: sudo systemctl restart sitespeedtestrunner.service