Incidents/2024-04-17 mw-on-k8s eqiad outage
document status: draft
Summary
Incident ID | 2024-04-17 mw-on-k8s eqiad outage | Start | 20240417-04-17 09:10:00 |
---|---|---|---|
Task | T362766 | End | 20240417-04-17 09:50:00 |
People paged | Responder count | 5 | |
Coordinators | 1 | Affected metrics/SLOs | |
Impact | According to Traffic's graphs, from HAproxy's POV, non 5XX requests (text+upload) dropped from 138K rps, to 120K - 130K rps for 20' |
mcrouter daemonset on mw-on-k8s: The mediawiki pod has 9 containers. We were working on reducting this number to 7, by introducing the mw-mcrouter service. In practice, our end goal was that each mw-on-k8s pod would use a standalone mcrouter pod running within the same node, instead of its own mcrouter container. From mediawiki's POV, mcrouter's location would be mcrouter-main.mw-mcrouter.svc.cluster.local:4442
instead of 127.0.0.1:11213
. The same change was deployed on codfw the day before, but codfw has less traffic.
This change increased the number of DNS requests towards CoreDNS, from an average of 40k req/s to 110k req/s, overwhelming the pods.
Status at ~09:20 UTC:
- scap was blocked waiting for the deployment of mw-on-k8s to finish
- during the deployment, the mediawiki pods were never becoming ready, and after a while scap attampted to rollback
- CoreDNS pods (3) were overwhelmed and oom killed over and over again (being left in an crashloopbackoff state)
Actions:
- depooled mediawiki reads from eqiad (via discovery)
- Increase memory limits and replicas for coredns on wikikube clusters
- terminate mw-server FQDN with a dot -
mcrouter-main.mw-mcrouter.svc.cluster.local.:4442
- reverted eqiad to use in-pod mcrouter container
- pooled eqiad back
Commits:
- https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020778
- https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020765
- https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1020774
Graphs:
Timeline
SAL entry: https://sal.toolforge.org/log/yk9Q644BGiVuUzOdxNwu
All times in UTC.
- 09:08 effie runs scap sync-world to deploy mediawiki deployments: use mcrouter daemonset for both DCs T346690
- 09:10 antoine observes a higher level of mw related events arriving to logstash
- 09:18 OUTAGE BEGINS
- 09:18 ALERT: (PHPFPMTooBusy) firing: (3) Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 30.2% idle
- 09:26 ALERT: (MediaWikiLatencyExceeded) firing: (4) p75 latency high: eqiad mw-api-ext (k8s) 6.79s
- 09:44 claime depools eqiad in mw-web-ro, mw-api-ext-ro, mw-api-int-ro
- 09:44 OUTAGE ENDS
- 10:00 akosiaris manually bumps coredns pods to 6 (eqiad+codfw)
- 10:16 effie merges and deploys any relevant code changes and reverts https://gerrit.wikimedia.org/r/1020768 and https://gerrit.wikimedia.org/r/1020774
- 10:53 effie pools back eqiad for mw-web-ro, mw-api-ext-ro, mw-api-int-ro
Detection
Antoine noticed an elevated number of events coming from the mediawiki channel on logstash. A few minutes later we got our first alert that we are running out of available php workers.
Conclusions
What went well?
- Everyone in ServiceOps was around
- Janis quickly figured out that we were missing the final dot in the FQDN
What went poorly?
- While we deployed the very same change on codfw the day before, we didn't properly analyse the impact.
Where did we get lucky?
- No luck.
Links to relevant documentation
Actionables
TBA
Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | ||
Were the people who responded prepared enough to respond effectively | |||
Were fewer than five people paged? | |||
Were pages routed to the correct sub-team(s)? | |||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | |||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | ||
Was a public wikimediastatus.net entry created? | |||
Is there a phabricator task for the incident? | |||
Are the documented action items assigned? | |||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | |||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
||
Were the people responding able to communicate effectively during the incident with the existing tooling? | |||
Did existing monitoring notify the initial responders? | |||
Were the engineering tools that were to be used during the incident, available and in service? | |||
Were the steps taken to mitigate guided by an existing runbook? | |||
Total score (count of all “yes” answers above) |