Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown

This happens when one of the haproxy servers is down from that haproxy instance (one of the hosts haproxy redirects too, not to be confused with the haproxy server itself).

The procedures in this runbook require admin permissions to complete.

Error / Incident

In the alert you should see the instance (the haproxy host), the backend (the web service being proxied) and the server (the server the requests are proxied to). That might get you started on where to look.

Debugging

You'll get the haproxy that's seeing issues from the instance label of the alert, and the backend that's failing from the server label.

A generic way of debugging is going to the haproxy config and checking the backend url:

dcaro@tools-k8s-haproxy-6:~$ grep -R <backend> /etc/haproxy/conf.d
# then check the config for that backend, specifically the baackend->server section:
dcaro@tools-k8s-haproxy-6:~$ cat /etc/haproxy/conf.d/k8s-ingress.cfg
...
backend k8s-ingress
...
    server tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud:30002 check
    server tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud:30002 check
    server tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud:30002 check

# those will be the hosts/ports that are failing, you can check each of the manually from haproxy:
dcaro@tools-k8s-haproxy-6:~$ nc -zv tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud 30002

Common issues

Add new issues here when you encounter them!

Some state gets stuck (calico?), and restarting the pods flushes it. You can try restarting all the pods in the backend service affeccted (ex. ingres-nginx, or api-gateway).

Related information

See HAProxy for generic information like commands/etc.

Old incidents

Add any incident tasks here!

toolforge: ingress errors 2025-03-05 task T387959