Catalyst/Incidents/2025-01-29

Time to restore: 5:32

Timeline

2025-01-29T17:36:03.719212+00:00: OOM killer starts to thrash and thrashed until soft reboot at 18:38 per dmesg
2025-01-29T17:42:00+00:00: esanders https://patchdemo.wmcloud.org/ is down for me, is that known?
Noting that load in grafana is 400(?!)
2025-01-29T18:38:01+00:00 soft reboot
2025-01-29T18:43:00+00:00 successful reboot + SSH, uptime usage immediately spikes as k3s spins up (5m avg load > 20)
2025-01-29T20:51:17+00:00: Fix Catalyst Environments
...: Resize k3s 16GB Ram -> 32 GB
...: Fix Catalyst Environments (again)
2025-01-29T23:08:00+00:00: All clear

Symptoms

website doesn't load
cannot SSH into machine

Overview

Machine ran out of memory due to too many environments
- Nothing malicious, normal use
OOM Killer caused full system lock-up
Reboot got us back online
Knock-on problems with catalyst environments added complications

Logs off of Horizon

[331315.051863] Memory cgroup out of memory: Killed process 5521 (mysqld) total-vm:4204604kB, anon-rss:380944kB, file-rss:6152kB, shmem-rss:0kB, UID:1001 pgtables:1204kB oom_score_adj:985
[687059.151280] Memory cgroup out of memory: Killed process 878386 (mysqld) total-vm:3942160kB, anon-rss:381940kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1184kB oom_score_adj:985
[687059.156277] Memory cgroup out of memory: Killed process 878386 (mysqld) total-vm:3942160kB, anon-rss:382004kB, file-rss:2844kB, shmem-rss:0kB, UID:1001 pgtables:1184kB oom_score_adj:985
[687059.159804] Memory cgroup out of memory: Killed process 2225273 (runc:[2:INIT]) total-vm:1236700kB, anon-rss:3424kB, file-rss:5808kB, shmem-rss:0kB, UID:1001 pgtables:108kB oom_score_adj:985
[1051464.115192] Memory cgroup out of memory: Killed process 2225342 (mysqld) total-vm:4731292kB, anon-rss:381436kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1252kB oom_score_adj:985
[1051464.118194] Memory cgroup out of memory: Killed process 3460173 (runc:[2:INIT]) total-vm:1236536kB, anon-rss:3176kB, file-rss:5816kB, shmem-rss:0kB, UID:1001 pgtables:108kB oom_score_adj:985

After everything came back, new problems:

CPU running hot for certain php processes
Checked out /proc/<pid>/environ to find the helm deployment for an out-of-control php process
Traced to jobrunner on wiki-5648f3da62-146-mediawiki-b9b775bb8-l8fjv pod
- Thrashing with errors (can't find jobs table in database)
- All Catalyst wikis also reporting sql errors
- Turns out, the mysql instances lost their tables
- Tried to redeploy
  - initcontainer won't run again if pod is deleted
  - initcontainer runs install.sh/post-install.sh which is how the database is set up so we can't re-run using kubectl we'll have to do helm uninstall/reinstall

Noticed inotify.max_user_instances did not persist: https://phabricator.wikimedia.org/T383280

root@k3s:~# sysctl fs.inotify.max_user_instances                                                                                                                
fs.inotify.max_user_instances = 128
root@k3s:~# sysctl -w fs.inotify.max_user_instances=1024
fs.inotify.max_user_instances = 1024
root@k3s:~# sysctl fs.inotify.max_user_instances                                                                                                        
fs.inotify.max_user_instances = 1024

20:51:17: bringing catalyst envs back via helm uninstall/reinstall:

Get list of catalyst releases: KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -n cat-env | awk '/wiki-/ { print $1 }' > list-envs.txt
Get copy of helm chart: git clone ci-charts
Copy all values from running deploys (that have no sql database): while read i; do helm get values -n cat-env "$i" -o yaml > values/"$i".yaml; done < list-envs.txt
Uninstall all helm releases: while read i; do helm uninstall -n cat-env "$i"; done < list-envs.txt
Reinstall all helm releases: while read i; do helm install -n cat-env $i ci-charts/mediawiki -f values/$i.yaml
Troubleshoot any failing releases: Set debug.initContainer = true in values/<helm-release-name>.yaml

These commands were written as a collection of scripts here: https://gitlab.wikimedia.org/kindrobot/reinstall-bad-catalyst-envs

Upped memory on k3s vm 16GB -> 32GB DB problem happened again(!) following reboot after VM resize :((

TODOS

Figure out inotify persistence
figure out database pv persistence problem
Done publish scripts for redeploying all catalyst patchdemos that we wrote during this incident
Add an api call to reinstall a wiki
- Add a button on UI?
Add CPU and memory limits for wikis
Done Increase the memory on the k3s box
sleep if init container fails for debugging
Admin password for wikis is incorrect, fix that
Fix: if you leave the page then the patchdemo db doesn't update
Done Turn off the checkbox to use patchdemo (until some of the above is better)
See if we can make alerts in grafana alert us

23:08: all clear