Jump to content

Incidents/2025-02-28 www.wikipedia.org redirect

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-02-28 www.wikipedia.org redirect Start 2025-02-27 03:30:00
Task T387549 End 2025-02-28 10:35:33
People paged 0 Responder count 7
Coordinators Jcrespo Affected metrics/SLOs
Impact www.wikipedia.org (and only that) seems to be 301-redirecting to itself in an infinite loop, preventing access to some users
3xx varnish aggregated http return codes of the text cluster

An incorrect Apache configuration was deployed, which led to an infinite redirect of www.wikipedia.org to itself, preventing its access. Due to CDN cache this was not immediately apparent, as caches can last up to 24 hours. So during the following night, as cache servers expired its result code, it started making the page unavailable to some users, with a slow ramp up, starting around 3:30 for a small number of users and expanding to a majority (but not all) of users by around 9 am (all times in UTC). Actual wiki application (content reading and editing) was not affected, but due to the homepage being the entry point of searching content of some users (specially non-expert users), some people reported that Wikipedia was down due to the high visibility of the impact.

Timeline

All times in UTC.

27-feb:

28-feb:

  • ~3:30 excess 301 redirects start on eqiad VISIBLE OUTAGE STARTS HERE (very slow ramp up)
  • ~4:42 excess 301 redirects start on drmrs
  • 08:25 - Task opened Wikipedia central page (https://www.wikipedia.org) fails to load with Too Many Redirects error task T387549
  • 09:29 jcrespo alerts on call responders: we have a full blown outage https://www.wikipedia.org/. Initial reproduction conversation starts, as it cannot be replicated reliably (due to cache)
  • 09:36 - Status panel update Investigating - We are aware that many users are having trouble accessing the portal www.wikipedia.org for excessive redirects, and we are investigating.
  • 09:43  Incident opened.  Jcrespo becomes IC.
  • 09:45 - 09:50 Vgutierrez, Elukey and Joe narrow down the issue to not being cache layer, but mediawiki and point to yesterday's deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1080357 The decision is to revert and purge caches.
  • 10:05 Revert is merged after CI runs and deployment started
  • 10:16 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123601 (unrelated) is causing errors, preventing a clean revert
  • 10:20 Revert is merged on codfw, cache purging done. Issue is fixed on codfw an dependent caches, but continues on eqiad
  • 10:35 Deploy to fix https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1123601 is done on both dcs, cache purged on eqiad OUTAGE FINISHES HERE
  • 10:57 After several users confirm the issue is gone, the incident is considered resolved, status page and task updated

Detection

The slow ramp up due to cache expiration, and this happening at different rates on different locations led to not being able to reproduce and the issue consistently & not being immediately visible for a long time

No page or alert went off in the infrastructure. The issue was detected when users reported "wikipedia being down" on social media, and jcrespo pinged the people on call to start the incident response process.

A task was filed rather promptly, but it didn't reach SREs as it was (correctly) classified as a #Wikimedia-portals issue, but not with an SRE/Incident tag that was visible to real-time incident responders.

Conclusions

What went well?

  • When pinged, the right people with the right knowledge jumped in to help (on call, traffic, service ops, databases), even on domains people are not familiar with
  • Users reported promptly visible issues on external websites, social media and phabricator
  • Status page reflected reality relatively promptly and was seen being actively used in the wild to get information of the status of the fix

What went poorly?

The less than 300 errors/hour as seen from clients (compared to the 150K/s resquests per second) did not create any alarm on existing monitoring- the rate of errors was very low, but it was very visible for www.wikipedia.org users
  • On change deploy, basic usability of the site was (if any) tested through the cache later, and not to the uncached mechanisms (command line, uncached web browser cache, etc.)
  • No page or alert went off both at mediawiki side or on cache side for a very visible entry point to our websites
  • A small amount of people have ever modified the status page, despite being demonstratively very useful to end users.
  • Incident documentation is hard to navigate and use, even by veteran incident responders
  • Tests in the commit were not thorough enough - such a scenario should have been included in httpbb tests instead of the more rudimentary tests given.

Where did we get lucky?

  • If Jaime hadn't seen user complains on social media, how much time would have passed until incident response process was started?

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? yes IC was not on call
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes no one was paged
Were pages routed to the correct sub-team(s)? no
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. no no alert was rised
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no It was resolved days later/no summary
Was a public wikimediastatus.net entry created? yes it was done again by Jaime
Is there a phabricator task for the incident? yes It was not created by SRE
Are the documented action items assigned? no there are no concrete action items
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes This kind of Apache config issue has not happened in a while; Issue usually caught with httpbb
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. yes
Were the people responding able to communicate effectively during the incident with the existing tooling? no Creating an official status page was difficult
Did existing monitoring notify the initial responders? no
Were the engineering tools that were to be used during the incident, available and in service? yes, but There was an unrelated incident that complicated issues
Were the steps taken to mitigate guided by an existing runbook? no Too specific to be a runbook-able thing, really
Total score (count of all “yes” answers above) 8