Catalyst/Incidents/2025-01-29
Appearance
< Catalyst
Time to restore: 5:32
Timeline
- 2025-01-29T17:36:03.719212+00:00: OOM killer starts to thrash and thrashed until soft reboot at 18:38 per dmesg
- 2025-01-29T17:42:00+00:00: esanders https://patchdemo.wmcloud.org/ is down for me, is that known?
- Noting that load in grafana is 400(?!)
- 2025-01-29T18:38:01+00:00 soft reboot
- 2025-01-29T18:43:00+00:00 successful reboot + SSH, uptime usage immediately spikes as k3s spins up (5m avg load > 20)
- 2025-01-29T20:51:17+00:00: Fix Catalyst Environments
- ...: Resize
k3s
16GB Ram -> 32 GB - ...: Fix Catalyst Environments (again)
- 2025-01-29T23:08:00+00:00: All clear
Symptoms
- website doesn't load
- cannot SSH into machine
Overview
- Machine ran out of memory due to too many environments
- Nothing malicious, normal use
- OOM Killer caused full system lock-up
- Reboot got us back online
- Knock-on problems with catalyst environments added complications
Logs off of Horizon
[331315.051863] Memory cgroup out of memory: Killed process 5521 (mysqld) total-vm:4204604kB, anon-rss:380944kB, file-rss:6152kB, shmem-rss:0kB, UID:1001 pgtables:1204kB oom_score_adj:985 [687059.151280] Memory cgroup out of memory: Killed process 878386 (mysqld) total-vm:3942160kB, anon-rss:381940kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1184kB oom_score_adj:985 [687059.156277] Memory cgroup out of memory: Killed process 878386 (mysqld) total-vm:3942160kB, anon-rss:382004kB, file-rss:2844kB, shmem-rss:0kB, UID:1001 pgtables:1184kB oom_score_adj:985 [687059.159804] Memory cgroup out of memory: Killed process 2225273 (runc:[2:INIT]) total-vm:1236700kB, anon-rss:3424kB, file-rss:5808kB, shmem-rss:0kB, UID:1001 pgtables:108kB oom_score_adj:985 [1051464.115192] Memory cgroup out of memory: Killed process 2225342 (mysqld) total-vm:4731292kB, anon-rss:381436kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1252kB oom_score_adj:985 [1051464.118194] Memory cgroup out of memory: Killed process 3460173 (runc:[2:INIT]) total-vm:1236536kB, anon-rss:3176kB, file-rss:5816kB, shmem-rss:0kB, UID:1001 pgtables:108kB oom_score_adj:985
After everything came back, new problems:
- CPU running hot for certain php processes
- Checked out
/proc/<pid>/environ
to find the helm deployment for an out-of-control php process - Traced to jobrunner on
wiki-5648f3da62-146-mediawiki-b9b775bb8-l8fjv
pod- Thrashing with errors (can't find jobs table in database)
- All Catalyst wikis also reporting sql errors
- Turns out, the mysql instances lost their tables
- Tried to redeploy
- initcontainer won't run again if pod is deleted
- initcontainer runs install.sh/post-install.sh which is how the database is set up so we can't re-run using kubectl we'll have to do helm uninstall/reinstall
Noticed inotify.max_user_instances did not persist: https://phabricator.wikimedia.org/T383280
root@k3s:~# sysctl fs.inotify.max_user_instances fs.inotify.max_user_instances = 128 root@k3s:~# sysctl -w fs.inotify.max_user_instances=1024 fs.inotify.max_user_instances = 1024 root@k3s:~# sysctl fs.inotify.max_user_instances fs.inotify.max_user_instances = 1024
20:51:17: bringing catalyst envs back via helm uninstall/reinstall:
- Get list of catalyst releases:
KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -n cat-env | awk '/wiki-/ { print $1 }' > list-envs.txt
- Get copy of helm chart: git clone ci-charts
- Copy all values from running deploys (that have no sql database):
while read i; do helm get values -n cat-env "$i" -o yaml > values/"$i".yaml; done < list-envs.txt
- Uninstall all helm releases:
while read i; do helm uninstall -n cat-env "$i"; done < list-envs.txt
- Reinstall all helm releases:
while read i; do helm install -n cat-env $i ci-charts/mediawiki -f values/$i.yaml
- Troubleshoot any failing releases: Set
debug.initContainer
= true invalues/<helm-release-name>.yaml
These commands were written as a collection of scripts here: https://gitlab.wikimedia.org/kindrobot/reinstall-bad-catalyst-envs
Upped memory on k3s vm 16GB -> 32GB DB problem happened again(!) following reboot after VM resize :((
TODOS
- Figure out inotify persistence
- figure out database pv persistence problem
Done publish scripts for redeploying all catalyst patchdemos that we wrote during this incident
- Add an api call to reinstall a wiki
- Add a button on UI?
- Add CPU and memory limits for wikis
Done Increase the memory on the k3s box
- sleep if init container fails for debugging
- Admin password for wikis is incorrect, fix that
- Fix: if you leave the page then the patchdemo db doesn't update
Done Turn off the checkbox to use patchdemo (until some of the above is better)
- See if we can make alerts in grafana alert us
23:08: all clear