Performance/Synthetic testing/OnDemandTesting/Runbook
This is the runbook for investigating and fixing problems with the on demand testing.
Meta
- Issue tracker (Phabricator): synthetic-performance-testing
- Documentation: OnDemandTesting
Debug missing metrics direct tests
The direct tests against Wikipedia goes through the direct test infrastructure. If we miss out on metrics in Grafana/Graphite from the direct tests it could be either be problems on the server that runs the tests, Graphite or the infrastructure setup for the on demand testing.
All services are setup up to restart on failures so in only worst case scenario the systems should be down.
The most important thing for everything to work is the queue system. If the queue is down, no tests can be added and no tests can be picked up and run.
Is the queue system working?
Log into the machine that runs the queue: ssh queue.webperformancetest.eqiad1.wikimedia.cloud
Then change user: sudo su - queue
Verify that the container (a keydb container) is up and running with docker ps
If the system is not up you need to look into the logs to try to understand what's wrong.
Logs from systemctl: sudo journalctl -u keydb-queue.service -n 100
Logs from the docker container: docker logs queue-keydb-1
If you you need to stop/start/restart the service you should use systemctl: sudo systemctl restart keydb-queue
Is the database up and running?
Log into the machine that runs the queue: ssh db.webperformancetest.eqiad1.wikimedia.cloud
Then change user: sudo su - database
Verify that the container (a PostgreSQL container) is up and running with docker ps
If it's not running you need to look in the logs and try to understand what's gone wrong:
Logs from the servicesudo journalctl -u database.service -n 100
Logs from the container:docker logs database-postgresql-1
If you you need to stop/start/restart the service you should use systemctl: sudo systemctl restart database.service
Is the testrunner/API that takes on jobs running?
Start by verifying that the tests actually runs. Log into the server ssh USERNAME@138.201.135.103
and switch to the correct user su - sitespeedio
Then you can check the logs what's going on. The stdout log should continuously add log entries about tests that continues to run. Tail the log for a while and see what is happening: tail -f /var/log/sitespeedio/testrunner-stdout.log
You can also look for errors in the the stderr log: tail -f /var/log/sitespeedio/testrunner-stderr.log
Debug ondemand testing GUI
You can check if the GUI is down by accessing https://wikiperformance.wmcloud.org/ in your web browser. You can also see if the database is working by accessing https://wikiperformance.wmcloud.org/search/ and scroll down to see tests that have finished. If you see test data here, you know that the database is working.
The GUI frontend run on the database server (the frontend is lightweight). If the GUI is down/not working it's common that the queue or the database is down. If you checked those two and they are working, you can investigate the GUI.
You can check the log using tail -f /var/log/sitespeed.io/server-stdout.log
If the GUI service needs to be restarted you should use: sudo systemctl restart sitespeedserver.service