Jump to content

Performance/Synthetic testing/OnDemandTesting/Runbook

From Wikitech

This is the runbook for investigating and fixing problems with the on demand testing.

Meta

Debug missing metrics direct tests

The direct tests against Wikipedia goes through the direct test infrastructure. If we miss out on metrics in Grafana/Graphite from the direct tests it could be either be problems on the server that runs the tests, Graphite or the infrastructure setup for the on demand testing.

All services are setup up to restart on failures so in only worst case scenario the systems should be down.

The most important thing for everything to work is the queue system. If the queue is down, no tests can be added and no tests can be picked up and run.

Is the queue system working?

Log into the machine that runs the queue: ssh queue.webperformancetest.eqiad1.wikimedia.cloud

Then change user: sudo su - queue

Verify that the container (a keydb container) is up and running with docker ps

If the system is not up you need to look into the logs to try to understand what's wrong.

Logs from systemctl: sudo journalctl -u keydb-queue.service -n 100

Logs from the docker container: docker logs queue-keydb-1

If you you need to stop/start/restart the service you should use systemctl: sudo systemctl restart keydb-queue

Is the database up and running?

Log into the machine that runs the queue: ssh db.webperformancetest.eqiad1.wikimedia.cloud

Then change user: sudo su - database

Verify that the container (a PostgreSQL container) is up and running with docker ps

If it's not running you need to look in the logs and try to understand what's gone wrong:

Logs from the servicesudo journalctl -u database.service -n 100

Logs from the container:docker logs database-postgresql-1

If you you need to stop/start/restart the service you should use systemctl: sudo systemctl restart database.service

Is the testrunner/API that takes on jobs running?

Start by verifying that the tests actually runs. Log into the server ssh USERNAME@138.201.135.103 and switch to the correct user su - sitespeedio

Then you can check the logs what's going on. The stdout log should continuously add log entries about tests that continues to run. Tail the log for a while and see what is happening: tail -f /var/log/sitespeedio/testrunner-stdout.log

You can also look for errors in the the stderr log: tail -f /var/log/sitespeedio/testrunner-stderr.log

Debug ondemand testing GUI

You can check if the GUI is down by accessing https://wikiperformance.wmcloud.org/ in your web browser. You can also see if the database is working by accessing https://wikiperformance.wmcloud.org/search/ and scroll down to see tests that have finished. If you see test data here, you know that the database is working.

The GUI frontend run on the database server (the frontend is lightweight). If the GUI is down/not working it's common that the queue or the database is down. If you checked those two and they are working, you can investigate the GUI.

You can check the log using tail -f /var/log/sitespeed.io/server-stdout.log

If the GUI service needs to be restarted you should use: sudo systemctl restart sitespeedserver.service