Jump to content

Catalyst/Incidents/2025-01-29

From Wikitech

Time to restore: 5:32

Timeline

  • 2025-01-29T17:36:03.719212+00:00: OOM killer starts to thrash and thrashed until soft reboot at 18:38 per dmesg
  • 2025-01-29T17:42:00+00:00: esanders https://patchdemo.wmcloud.org/ is down for me, is that known?
  • Noting that load in grafana is 400(?!)
  • 2025-01-29T18:38:01+00:00 soft reboot
  • 2025-01-29T18:43:00+00:00 successful reboot + SSH, uptime usage immediately spikes as k3s spins up (5m avg load > 20)
  • 2025-01-29T20:51:17+00:00: Fix Catalyst Environments
  • ...: Resize k3s 16GB Ram -> 32 GB
  • ...: Fix Catalyst Environments (again)
  • 2025-01-29T23:08:00+00:00: All clear

Symptoms

  • website doesn't load
  • cannot SSH into machine

Overview

  • Machine ran out of memory due to too many environments
    • Nothing malicious, normal use
  • OOM Killer caused full system lock-up
  • Reboot got us back online
  • Knock-on problems with catalyst environments added complications

Logs off of Horizon

[331315.051863] Memory cgroup out of memory: Killed process 5521 (mysqld) total-vm:4204604kB, anon-rss:380944kB, file-rss:6152kB, shmem-rss:0kB, UID:1001 pgtables:1204kB oom_score_adj:985
[687059.151280] Memory cgroup out of memory: Killed process 878386 (mysqld) total-vm:3942160kB, anon-rss:381940kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1184kB oom_score_adj:985
[687059.156277] Memory cgroup out of memory: Killed process 878386 (mysqld) total-vm:3942160kB, anon-rss:382004kB, file-rss:2844kB, shmem-rss:0kB, UID:1001 pgtables:1184kB oom_score_adj:985
[687059.159804] Memory cgroup out of memory: Killed process 2225273 (runc:[2:INIT]) total-vm:1236700kB, anon-rss:3424kB, file-rss:5808kB, shmem-rss:0kB, UID:1001 pgtables:108kB oom_score_adj:985
[1051464.115192] Memory cgroup out of memory: Killed process 2225342 (mysqld) total-vm:4731292kB, anon-rss:381436kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:1252kB oom_score_adj:985
[1051464.118194] Memory cgroup out of memory: Killed process 3460173 (runc:[2:INIT]) total-vm:1236536kB, anon-rss:3176kB, file-rss:5816kB, shmem-rss:0kB, UID:1001 pgtables:108kB oom_score_adj:985

After everything came back, new problems:

  • CPU running hot for certain php processes
  • Checked out /proc/<pid>/environ to find the helm deployment for an out-of-control php process
  • Traced to jobrunner on wiki-5648f3da62-146-mediawiki-b9b775bb8-l8fjv pod
    • Thrashing with errors (can't find jobs table in database)
    • All Catalyst wikis also reporting sql errors
    • Turns out, the mysql instances lost their tables
    • Tried to redeploy
      • initcontainer won't run again if pod is deleted
      • initcontainer runs install.sh/post-install.sh which is how the database is set up so we can't re-run using kubectl we'll have to do helm uninstall/reinstall

Noticed inotify.max_user_instances did not persist: https://phabricator.wikimedia.org/T383280

root@k3s:~# sysctl fs.inotify.max_user_instances                                                                                                                
fs.inotify.max_user_instances = 128
root@k3s:~# sysctl -w fs.inotify.max_user_instances=1024
fs.inotify.max_user_instances = 1024
root@k3s:~# sysctl fs.inotify.max_user_instances                                                                                                        
fs.inotify.max_user_instances = 1024

20:51:17: bringing catalyst envs back via helm uninstall/reinstall:

  • Get list of catalyst releases: KUBECONFIG=/etc/rancher/k3s/k3s.yaml helm list -n cat-env | awk '/wiki-/ { print $1 }' > list-envs.txt
  • Get copy of helm chart: git clone ci-charts
  • Copy all values from running deploys (that have no sql database): while read i; do helm get values -n cat-env "$i" -o yaml > values/"$i".yaml; done < list-envs.txt
  • Uninstall all helm releases: while read i; do helm uninstall -n cat-env "$i"; done < list-envs.txt
  • Reinstall all helm releases: while read i; do helm install -n cat-env $i ci-charts/mediawiki -f values/$i.yaml
  • Troubleshoot any failing releases: Set debug.initContainer = true in values/<helm-release-name>.yaml

These commands were written as a collection of scripts here: https://gitlab.wikimedia.org/kindrobot/reinstall-bad-catalyst-envs

Upped memory on k3s vm 16GB -> 32GB DB problem happened again(!) following reboot after VM resize :((

TODOS

  • Figure out inotify persistence
  • figure out database pv persistence problem
  • Yes Done publish scripts for redeploying all catalyst patchdemos that we wrote during this incident
  • Add an api call to reinstall a wiki
    • Add a button on UI?
  • Add CPU and memory limits for wikis
  • Yes Done Increase the memory on the k3s box
  • sleep if init container fails for debugging
  • Admin password for wikis is incorrect, fix that
  • Fix: if you leave the page then the patchdemo db doesn't update
  • Yes Done Turn off the checkbox to use patchdemo (until some of the above is better)
  • See if we can make alerts in grafana alert us

23:08: all clear