Incidents/2025-02-28 www.wikipedia.org redirect

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2025-02-28 www.wikipedia.org redirect	Start	2025-02-27 03:30:00
Task	T387549	End	2025-02-28 10:35:33
People paged	0	Responder count	7
Coordinators	Jcrespo	Affected metrics/SLOs
Impact	www.wikipedia.org (and only that) seems to be 301-redirecting to itself in an infinite loop, preventing access to some users

An incorrect Apache configuration was deployed, which led to an infinite redirect of www.wikipedia.org to itself, preventing its access. Due to CDN cache this was not immediately apparent, as caches can last up to 24 hours. So during the following night, as cache servers expired its result code, it started making the page unavailable to some users, with a slow ramp up, starting around 3:30 for a small number of users and expanding to a majority (but not all) of users by around 9 am (all times in UTC). Actual wiki application (content reading and editing) was not affected, but due to the homepage being the entry point of searching content of some users (specially non-expert users), some people reported that Wikipedia was down due to the high visibility of the impact.

Timeline

All times in UTC.

27-feb:

13:42 - deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080357 (due to caching, initial impact is efectually 0)

28-feb:

~3:30 excess 301 redirects start on eqiad VISIBLE OUTAGE STARTS HERE (very slow ramp up)
~4:42 excess 301 redirects start on drmrs
08:25 - Task opened Wikipedia central page (https://www.wikipedia.org) fails to load with Too Many Redirects error task T387549
09:29 jcrespo alerts on call responders: we have a full blown outage https://www.wikipedia.org/. Initial reproduction conversation starts, as it cannot be replicated reliably (due to cache)
09:36 - Status panel update Investigating - We are aware that many users are having trouble accessing the portal www.wikipedia.org for excessive redirects, and we are investigating.
09:43 Incident opened. Jcrespo becomes IC.
09:45 - 09:50 Vgutierrez, Elukey and Joe narrow down the issue to not being cache layer, but mediawiki and point to yesterday's deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080357 The decision is to revert and purge caches.
10:05 Revert is merged after CI runs and deployment started
10:16 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123601 (unrelated) is causing errors, preventing a clean revert
10:20 Revert is merged on codfw, cache purging done. Issue is fixed on codfw an dependent caches, but continues on eqiad
10:35 Deploy to fix https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123601 is done on both dcs, cache purged on eqiad OUTAGE FINISHES HERE
10:57 After several users confirm the issue is gone, the incident is considered resolved, status page and task updated

Detection

No page or alert went off in the infrastructure. The issue was detected when users reported "wikipedia being down" on social media, and jcrespo pinged the people on call to start the incident response process.

A task was filed rather promptly, but it didn't reach SREs as it was (correctly) classified as a #Wikimedia-portals issue, but not with an SRE/Incident tag that was visible to real-time incident responders.

Conclusions

What went well?

When pinged, the right people with the right knowledge jumped in to help (on call, traffic, service ops, databases), even on domains people are not familiar with
Users reported promptly visible issues on external websites, social media and phabricator
Status page reflected reality relatively promptly and was seen being actively used in the wild to get information of the status of the fix

What went poorly?

On change deploy, basic usability of the site was (if any) tested through the cache later, and not to the uncached mechanisms (command line, uncached web browser cache, etc.)
No page or alert went off both at mediawiki side or on cache side for a very visible entry point to our websites
A small amount of people have ever modified the status page, despite being demonstratively very useful to end users.
Incident documentation is hard to navigate and use, even by veteran incident responders
Tests in the commit were not thorough enough - such a scenario should have been included in httpbb tests instead of the more rudimentary tests given.

Where did we get lucky?

If Jaime hadn't seen user complains on social media, how much time would have passed until incident response process was started?

Links to relevant documentation

…

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Rewrite a more thorough patch for the intended behavior that includes checks for infinite redirect (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123622)

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?	yes	IC was not on call
	Were the people who responded prepared enough to respond effectively	yes
	Were fewer than five people paged?	yes	no one was paged
	Were pages routed to the correct sub-team(s)?	no
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.	no	no alert was rised
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?	no	It was resolved days later/no summary
	Was a public wikimediastatus.net entry created?	yes	it was done again by Jaime
	Is there a phabricator task for the incident?	yes	It was not created by SRE
	Are the documented action items assigned?	no	there are no concrete action items
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?	yes	This kind of Apache config issue has not happened in a while; Issue usually caught with httpbb
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.	yes
	Were the people responding able to communicate effectively during the incident with the existing tooling?	no	Creating an official status page was difficult
	Did existing monitoring notify the initial responders?	no
	Were the engineering tools that were to be used during the incident, available and in service?	yes, but	There was an unrelated incident that complicated issues
	Were the steps taken to mitigate guided by an existing runbook?	no	Too specific to be a runbook-able thing, really
Total score (count of all “yes” answers above)		8