Incidents/2025-02-28 www.wikipedia.org redirect
document status: draft
Summary
Incident ID | 2025-02-28 www.wikipedia.org redirect | Start | 2025-02-27 03:30:00 |
---|---|---|---|
Task | T387549 | End | 2025-02-28 10:35:33 |
People paged | 0 | Responder count | 7 |
Coordinators | Jcrespo | Affected metrics/SLOs | |
Impact | www.wikipedia.org (and only that) seems to be 301-redirecting to itself in an infinite loop, preventing access to some users |

An incorrect Apache configuration was deployed, which led to an infinite redirect of www.wikipedia.org to itself, preventing its access. Due to CDN cache this was not immediately apparent, as caches can last up to 24 hours. So during the following night, as cache servers expired its result code, it started making the page unavailable to some users, with a slow ramp up, starting around 3:30 for a small number of users and expanding to a majority (but not all) of users by around 9 am (all times in UTC). Actual wiki application (content reading and editing) was not affected, but due to the homepage being the entry point of searching content of some users (specially non-expert users), some people reported that Wikipedia was down due to the high visibility of the impact.
Timeline
All times in UTC.
27-feb:
- 13:42 - deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080357 (due to caching, initial impact is efectually 0)
28-feb:
- ~3:30 excess 301 redirects start on eqiad VISIBLE OUTAGE STARTS HERE (very slow ramp up)
- ~4:42 excess 301 redirects start on drmrs
- 08:25 - Task opened Wikipedia central page (https://www.wikipedia.org) fails to load with Too Many Redirects error task T387549
- 09:29 jcrespo alerts on call responders: we have a full blown outage https://www.wikipedia.org/. Initial reproduction conversation starts, as it cannot be replicated reliably (due to cache)
- 09:36 - Status panel update Investigating - We are aware that many users are having trouble accessing the portal www.wikipedia.org for excessive redirects, and we are investigating.
- 09:43 Incident opened. Jcrespo becomes IC.
- 09:45 - 09:50 Vgutierrez, Elukey and Joe narrow down the issue to not being cache layer, but mediawiki and point to yesterday's deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080357 The decision is to revert and purge caches.
- 10:05 Revert is merged after CI runs and deployment started
- 10:16 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123601 (unrelated) is causing errors, preventing a clean revert
- 10:20 Revert is merged on codfw, cache purging done. Issue is fixed on codfw an dependent caches, but continues on eqiad
- 10:35 Deploy to fix https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123601 is done on both dcs, cache purged on eqiad OUTAGE FINISHES HERE
- 10:57 After several users confirm the issue is gone, the incident is considered resolved, status page and task updated
Detection

No page or alert went off in the infrastructure. The issue was detected when users reported "wikipedia being down" on social media, and jcrespo pinged the people on call to start the incident response process.
A task was filed rather promptly, but it didn't reach SREs as it was (correctly) classified as a #Wikimedia-portals issue, but not with an SRE/Incident tag that was visible to real-time incident responders.
Conclusions
What went well?
- When pinged, the right people with the right knowledge jumped in to help (on call, traffic, service ops, databases), even on domains people are not familiar with
- Users reported promptly visible issues on external websites, social media and phabricator
- Status page reflected reality relatively promptly and was seen being actively used in the wild to get information of the status of the fix
What went poorly?

- On change deploy, basic usability of the site was (if any) tested through the cache later, and not to the uncached mechanisms (command line, uncached web browser cache, etc.)
- No page or alert went off both at mediawiki side or on cache side for a very visible entry point to our websites
- A small amount of people have ever modified the status page, despite being demonstratively very useful to end users.
- Incident documentation is hard to navigate and use, even by veteran incident responders
- Tests in the commit were not thorough enough - such a scenario should have been included in httpbb tests instead of the more rudimentary tests given.
Where did we get lucky?
- If Jaime hadn't seen user complains on social media, how much time would have passed until incident response process was started?
Links to relevant documentation
- …
Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.
Actionables
- Rewrite a more thorough patch for the intended behavior that includes checks for infinite redirect (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1123622)
Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.
Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | yes | IC was not on call |
Were the people who responded prepared enough to respond effectively | yes | ||
Were fewer than five people paged? | yes | no one was paged | |
Were pages routed to the correct sub-team(s)? | no | ||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | no | no alert was rised | |
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | no | It was resolved days later/no summary |
Was a public wikimediastatus.net entry created? | yes | it was done again by Jaime | |
Is there a phabricator task for the incident? | yes | It was not created by SRE | |
Are the documented action items assigned? | no | there are no concrete action items | |
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | yes | This kind of Apache config issue has not happened in a while; Issue usually caught with httpbb | |
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | yes | |
Were the people responding able to communicate effectively during the incident with the existing tooling? | no | Creating an official status page was difficult | |
Did existing monitoring notify the initial responders? | no | ||
Were the engineering tools that were to be used during the incident, available and in service? | yes, but | There was an unrelated incident that complicated issues | |
Were the steps taken to mitigate guided by an existing runbook? | no | Too specific to be a runbook-able thing, really | |
Total score (count of all “yes” answers above) | 8 |