Jump to content

Portal:Toolforge/Admin/Runbooks/ToolsNFSDown

From Wikitech

The ToolsNFSDown alert fires when the nfs-service service is not running or not being found in the stats.

The procedures in this runbook require admin permissions to complete.

Error / Incident

If the value is 0, then the service is down, if the value is -1 then prometheus is not gathering the stats correctly (the service might be down, we don't know).

As of 2024-07-30 the nfs server is tools-nfs-2.tools.eqiad1.wikimedia.cloud

Debugging

Check the service status

Ssh to the server and check the service status:

dcaro@tools-nfs-2:~$ sudo systemctl status nfs-server.service 
● nfs-server.service - NFS server and services
     Loaded: loaded (/lib/systemd/system/nfs-server.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2024-06-24 14:25:57 UTC; 1 months 5 days ago
   Main PID: 721 (code=exited, status=0/SUCCESS)
      Tasks: 0 (limit: 77152)
     Memory: 0B
        CPU: 0
     CGroup: /system.slice/nfs-server.service

Jun 24 14:25:55 tools-nfs-2 systemd[1]: Starting NFS server and services...
Jun 24 14:25:57 tools-nfs-2 systemd[1]: Finished NFS server and services.

If there's no stats

This is a tricky one and it will be related to the way we gather metrics on tools/toolsbeta.

Note that this is not directly related to the metricsinfra monitoring project, but toolforge's own setup.

You can start by going to the project's prometheus page and trying to get the stats there, example for tools:

https://tools-prometheus.wmflabs.org/tools/classic/graph?g0.range_input=1h&g0.expr=sum(up%7Bjob%3D%22tekton-pipelines-controller%22%7D)&g0.tab=1

Common issues

Add here any new common issues you find.

Old incidents

Add here any new tasks for incidents you might encounter.