Portal:Toolforge/Admin/Runbooks/ToolforgeKubernetesHAproxyServerDown
Appearance
This happens when one of the haproxy servers is down from that haproxy instance (one of the hosts haproxy redirects too, not to be confused with the haproxy server itself).
The procedures in this runbook require admin permissions to complete.
Error / Incident
In the alert you should see the instance (the haproxy host), the backend (the web service being proxied) and the server (the server the requests are proxied to). That might get you started on where to look.
Debugging
You'll get the haproxy that's seeing issues from the instance
label of the alert, and the backend that's failing from the server
label.
A generic way of debugging is going to the haproxy config and checking the backend url:
dcaro@tools-k8s-haproxy-6:~$ grep -R <backend> /etc/haproxy/conf.d
# then check the config for that backend, specifically the baackend->server section:
dcaro@tools-k8s-haproxy-6:~$ cat /etc/haproxy/conf.d/k8s-ingress.cfg
...
backend k8s-ingress
...
server tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud tools-k8s-ingress-7.tools.eqiad1.wikimedia.cloud:30002 check
server tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud tools-k8s-ingress-8.tools.eqiad1.wikimedia.cloud:30002 check
server tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud:30002 check
# those will be the hosts/ports that are failing, you can check each of the manually from haproxy:
dcaro@tools-k8s-haproxy-6:~$ nc -zv tools-k8s-ingress-9.tools.eqiad1.wikimedia.cloud 30002
Common issues
Add new issues here when you encounter them!
- Some state gets stuck (calico?), and restarting the pods flushes it. You can try restarting all the pods in the backend service affeccted (ex. ingres-nginx, or api-gateway).
Related information
- See HAProxy for generic information like commands/etc.
Old incidents
Add any incident tasks here!
- toolforge: ingress errors 2025-03-05 task T387959