Incidents/2022-12-18 World Cup
document status: draft
Summary
Incident ID | 2022-12-18 World Cup | Start | 2022-12-18 17:58:00 |
---|---|---|---|
Task | T325477 | End | 2022-12-18 18:22:00 |
People paged | 34 | Responder count | 6 (initial page, later others took over final corrections) |
Coordinators | Jcrespo | Affected metrics/SLOs | ? |
Impact | All API users experienced 5XX errors (11-12 thousand errors per second) or unreasonable latencies for 24 minutes. After that, there was degraded performance in the form of increased -but more reasonable- latency for around 3 hours. Edit rate got low during the 24 minute hard outage period, but recovered quickly after it (to a higher level than usual). |
For approximately 24 minutes, uncached calls to the API on the eqiad datacenter overloaded the application servers, running out of threads (all busy) creating unreasonable latency or failing to respond to requests. This caused sending errors to some clients using the action API and Parsoid in the primary datacenter. Elevated latencies persisted for the following 3 hours, when traffic load organically went down.
This was caused by the DiscussionTools MediaWiki extension adding a ResourceLoader module on almost all page views -even non-discussion pages- which created an API call, that, combined with a significant 50% increase in overall traffic, led to an overload and increased latencies on the API cluster. codfw app server cluster was mostly unaffected due to not receiving POST uncached traffic at the time (it is read-only).
Timeline
All times in UTC.
8 December 2022
- 22:17 <samtar@deploy1002> Started scap: Backport for Start mobile DiscussionTools A/B test (T321961) High traffic is observed (retroactively, on the 19th) to action=discussiontoolspageinfo, with a daily peak of around 2k req/s.
18 December 2022
- 17:55 Argentina scores the penalty that awards their national team the World Cup. A 50% traffic increase starts (we had larger spikes in the past, though, causing no load issues)
- 17:58 API app servers get overload with requests OUTAGE STARTS HERE
- 17:59 4 pages are sent: PHPFPMTooBusy (api_appserver & parsoid) / FrontendUnavailable (HAProxy & varnish)
- 18:XX Thinking it is load related, a restart is considered not the right move at the moment
- 18:20 App servers are attempted to be restarted in a desperate attempt to fix something, failing
END (FAIL) - Cookbook sre.mediawiki.restart-appservers (exit_code=99)
- 18:22 Load goes down enough for worker threads to not be 100% busy OUTAGE ENDS (but degraded state with increased latencies persist)
- 18:55 A rolling restart is finally run to try to improve the latency issues:
sudo cumin -b 1 -s 30 'A:mw-api and A:eqiad' 'restart-php7.4-fpm'
- 19:31 Rolling restart finishes, latencies still high
- 19:42 DiscussionTools is suspected to be the culprit based on utilization metrics, but there is no one around to debug
- 19:46 Latencies -due to organic traffic reduction, not to any action taken- decrease to a degraded, but not outage-like state. It is decided to set the ongoing outage as "resolved" and ask for debugging help the following day.
19 December 2022
- 03:08 tstarling@deploy1002: Synchronized wmf-config/InitialiseSettings.php: disable wgDiscussionToolsABTest task T325477 task T321961 (duration: 15m 23s)
- 09:15 ladsgroup@deploy1002: Finished scap: Backport for Emergency: discussiontoolspageinfo return empty response in non-talk ns (T325477) (duration: 09m 24s)
- 10:28 taavi@deploy1002: Finished scap: Backport for Only preload getPageData if there's thread data for the page (T325477) (duration: 07m 58s)
(any one of these three patches would have resolved the issue)
Detection
Monitoring and paging worked as intended, paging everybody (it was a weekend) as soon as the issue become major at 17:56:
- 17:59 Service: [FIRING:1] ProbeDown (10.2.2.22 ip4 api-https:443 probes/service http_api-https_ip4 ops page eqiad prometheus sre)
- 18:00 Service: [FIRING:1] PHPFPMTooBusy api_appserver (ops php7.4-fpm.service page eqiad prometheus sre)
- 18:00 Service: [FIRING:1] FrontendUnavailable (varnish-text page thanos sre)
- 18:01 Service: [FIRING:1] FrontendUnavailable cache_text (page thanos sre)
A task was created also by a community member at 19:03, when the team was already analyzing the issue.
However, there were reports that "php fpm busy has been flapping all weekend, just not enough to page".
Conclusions
OPTIONAL: General conclusions (bullet points or narrative)
What went well?
- Multiple people reacted to page although it was during World Cup finals
- Good insights with monitoring helped in pinpointing DiscussionTools as potential incident cause and World Cup traffic as the modulator
OPTIONAL: (Use bullet points) for example: automated monitoring detected the incident, outage was root-caused quickly, etc
What went poorly?
- Traffic spike due to World Cup was quite a surprise although the event was scheduled beforehand for month
- Knowledge and documentation of sre.mediawiki.restart-appservers cookbook (SERVICE name unknown). It took a while to find the cumin + "restart-php7.4-fpm" approach
- Increased load following the 8 December deployment was not noticed or not connected to the deployment
- Outage was caused by a known problem documented in a TODO comment T325477#8476954
OPTIONAL: (Use bullet points) for example: documentation on the affected service was unhelpful, communication difficulties, etc
Where did we get lucky?
- Traffic declined naturally and incident resolved automatically
- DoS between 18:17 and 18:36 had no big impact on performance(?), rate limiting worked(?)
OPTIONAL: (Use bullet points) for example: user's error report was exceptionally detailed, incident occurred when the most people were online to assist, etc
Links to relevant documentation
Service restarts#Application servers (also image/video scalers and job runners and parsoid)
Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.
Actionables
- task T325739 Fix sre.mediawiki.restart-appservers cookbook and doc Done
- Patch 869169 Done
- Patch 869163 Done
- task T325598: Avoid uncached action=discussiontoolspageinfo API calls on page load Done
- task T321961 Done
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | |
Were the people who responded prepared enough to respond effectively | no | SRE had no insight into existence of the DiscussionTools ongoing A/B Test | |
Were fewer than five people paged? | no | ||
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | ||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | yes | |
Was a public wikimediastatus.net entry created? | yes | https://www.wikimediastatus.net/incidents/2w4ygdj4vc20 | |
Is there a phabricator task for the incident? | yes | T325477 | |
Are the documented action items assigned? | yes | see above also completed so this incident doesn't repeat | |
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | no | ||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are
open tasks that would prevent this incident or make mitigation easier if implemented. |
no | Outage was caused by a known problem documented in a TODO comment T325477#8476954 |
Were the people responding able to communicate effectively during the incident with the existing tooling? | yes | ||
Did existing monitoring notify the initial responders? | yes | ||
Were the engineering tools that were to be used during the incident, available and in service? | yes | ||
Were the steps taken to mitigate guided by an existing runbook? | no | runbook failed | |
Total score (count of all “yes” answers above) | 8 |