Jump to content

Incidents/2025-03-12 ExternalStorage Database Cluster Overload

From Wikitech

document status: final

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-03-12 ExternalStorage Database Cluster Overload Start 2025-03-12 10:44:00, 2025-03-13 11:15:00
Task T389498 End 2025-03-12 13:44:00, 2025-03-13 14:15:00
People paged 2 (oncallers) Responder count 9: (round 1) volans, jynus, joe, claime (round2) Amir, brouberol, gehel, gmodena, (round 3) Amir, marostegui, joe, effie, hugh
Coordinators 3: effie, jynus, volans Affected metrics/SLOs No relevant SLOs exist
Impact We had minimal user facing impact: Global traffic had a ~0.03-0.04% error rate, edits were affected for a very short period of time, while p50 and p75 latency for Web and External API, were not affected

TL;DR: Within a span of two days, we observed high latency and database errors on mw-{api-int,parsoid, jobrunner}, all of which are deployments supporting MediaWiki for non-user-facing purposes. Our External Storage cluster experienced a significant load increase and a high number of connections. Notably, user-facing traffic was little impacted by this incident

During the scheduled PHP 8.1 Scap rollout for mw-{api-int,parsoid,jobrunner}, on March 12th & 13th 2025, we found ourselves having said deployments running without access to the Memcached cluster and without a valid prometheus StatsD address.

Without Memcached access, MediaWiki had to query the databases for objects that would typically be cached, leading to increased database load. At the time, MediaWiki was pushing stats to both Graphite and prometheus-statsd-exporter, thus the lack of a valid prometheus statsd address had no impact.  

During the incidents, we observed high load and a sharp increase in connections to the External Storage cluster, while the majority of MediaWiki errors (channel: error) visible at the time, were database related. The External Storage hosts are used to store the compressed text content of wiki page revisions

However, there was an overwhelming volume of memcached errors events (channel: memcached, ~2mil/min) which logstash was unable to process quickly enough, resulting in an observability gap where:

  • Logstash itself probably was not able to render the channel:memcached logstash dashboard, until after the incident, allowing it to consume all its backlog
  • Delays in the pipeline pipeline caused the MediaWikiMemcachedHighErrorRate alert to be missing, while the corresponding Grafana graph displayed no data during the incidents.

TODO: Assess actual user impact based on traffic metrics.

MediaWiki Databases Logstash WANCache
MediaWiki Memcached Errors collected by Prometheus, spiked at 50k/min. However, due to the pipeline being overloaded, the respected Alert went off and quickly reco
External Storage experienced sharp increase in QPS and Number of connections. Circuit breakers prevented the DBs from collapsing
Memcached Errors (# of events) Logstash was unable to process quickly the millions of events comming in.
WANObjectCache and SQLBlobStore Keygroup: Metadata are stored on Memcached. The high rate of miss.compute keys is a result of the lack of access to Memcached, where this metadata is stored
MediaWiki p50 latencies: User-facing deployments, mw-web and mw-api-ext, were minimally affected by the incident.
Core Databases followed a similar pattern to Extrenal Storage
Kafka Consumer Lag measures how much log processing is delayed compared to real-time ingestion.
Edge (Varnish) Memcached & mcrouter
Varnish traffic
Memcached Traffic had a sharp drop, since said deployments stopped having access to it.
mcrouter Requests follow the same pattern as Memcached.

Timeline

All times in UTC.

12th March 2025

  • 10:58 UTC: alerts for high backend response times started coming in
  • 13:19 gmodena and brouberol deployed a patch in cirrus-streaming-updater to reduce SUP parallelism 1126988[deployment-charts]
  • 13:33 Amir1 reduces the concurrency of categoryMembershipChange job in changeprop-jobqueue 1127000[deployment-charts]
  • 13:44 effie re-run Scap after revering  1126607[puppet] and 1126650 [deployment-charts]

13th March 2025

  • 11:15 effie begins a scap deployment (yes, again), to deploy 1127476[puppet] and 1127478[deployment-charts] for the mw-{api-int,parsoid,jobrunner} PHP 8.1 rollout
  • 11:34 effie performs a rolling restart on changeprop-jobqueue
  • 11:43 effie: rolling restarting mw-api-int
  • 11.44 Amir1disables the categorymembership job in changeprop-jobqueue 1127500[deployment-charts]
  • 11:46 effie reverts 1127476[puppet] and 1127478[deployment-charts]
  • 11:58 effie redeploys mw-api-int to pick up the revert
  • 12:07 hnowlan bumps the number of replicas for mw-api-int
  • 12:28 effie redeploys mw-jobrunner to pick up the revert
  • 14:06 effie redeploys mw-parsoid to pick up the revert

Detection

Our working theories were:

  • High traffic from specific network/client
  • Scheduled jobs that went awry
  • Greedy changeprop-jobqueue jobs

All the above are the usual suspects when we experience MediaWiki latency and/or Database load. Detection was not easy here, for reasons explained in “what went wrong”. However, we were able to thoroughly sort out what went on after the fact.

Contributing Factors

Key factors that contributed in causing this incident, as well as delaying its root cause, were: scap, php-fpm envvars.inc include file, and the logstash-prometheus pipeline.

What does Scap do?

Scap is our deployment tool for MediaWiki. Scap takes care of 3 very important steps:

  • build and push MediaWiki images
  • Update helmfile-defaults with the latest image version tag per deployment, and per release.
    • The image flavour (if it is a 7.4 or a 8.1 image) of each deployment-release combination is defined in puppet in kubernetes.yaml
  • Runs helmfile on all deployments running MediaWiki

To provide a visual example, the latest scap run updated the  helmfile-defaults for the main release of mw-parsoid (aka mw-parsoid-main) as follows:

docker:
  registry: docker-registry.discovery.wmnet
main_app:
  image: restricted/mediawiki-multiversion:2025-03-18-101751-publish-81
mw:
  httpd:
    image_tag: restricted/mediawiki-webserver:2025-03-18-101751-webserver

What is this envvars.inc include file in php-fpm?

We are exporting two very important environmental variables to php-fpm

  • MCROUTER_SERVER: static IP address defined in deployment-charts, essentially the memcached/mcrouter address, defaults to 127.0.0.1:11213
  • STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST: populated and injected into pods by the k8s api, unset by default


In kubernetes, we put both variables in a ConfigMap called mediawiki-main-php-envvars, which in turn mount in the container as  /etc/php/<X.Y>/fpm/env/envvars.inc

PHP-FPM reads the environmental variables from a hardcoded include directory, whose exact location depends on the PHP version.

In the publish-74 container image, that would be:

[www]
listen = ${FCGI_URL}
<snip>
; MediaWiki helm chart via the php.envvars value.
include = /etc/php/7.4/fpm/env/*.inc

In the publish-81 container image, that would be:

[www]
listen = ${FCGI_URL}
<snip>
; MediaWiki helm chart via the php.envvars value.
include = /etc/php/8.1/fpm/env/*.inc

So, what was broken then?

Our PHP 8.1 mw-{api-int,parsoid,jobrunner} rollout consisted of two sister patches:

  • 1126607[puppet] switching  mw-{api-int,parsoid, jobrunner} the MediaWiki image flavour from publish-74 to  publish-81
  • 1126650 [deployment-charts] which in practice would change the mount location of envvars.inc from /etc/php/7.4/fpm/env/envvars.inc to /etc/php/8.1/fpm/env/envvars.inc

We performed a scap deployment to deploy the above. Our expectation was that after the deployment, was that we would have:

  • mw-{api-int, parsoid, jobrunner} running the mediawiki-multiversion publish-81 image and
  • The mediawiki-main-php-envvars configMap mounted as  /etc/php/8.1/fpm/env/envvars.inc

Due to an unexpected Scap behavior, explained below, what was actually rolled out in production was :

  • mw-{api-int,parsoid,jobrunner} running the mediawiki-multiversion publish-74
  • The mediawiki-main-php-envvars configMap , mounted at /etc/php/8.1/fpm/env/envvars.inc

As the PHP 7.4 image (publish-74) was in use, the includes under /etc/php/7.4/fpm/env/*.inc, contained default values, so they were not useful.

How did we break this?

During the scheduled deployment, the Scap command was executed with the flag -Dbuild_mw_container_image:False. This flag is commonly utilised by Site Reliability Engineers (SREs) as, in most cases, our changes do not necessitate rebuilding container images. Specifically, transitioning the main release of mw-{api-int, parsoid, jobrunner} to the publish-81 image would not require an image rebuild as we have already publish-81 built and cached.

However, this transition would necessitate updates to the helmfile-defaults of the main releases for mw-{api-int,parsoid,jobrunner} so to replace  the latest -publish-74 image tag with the latest -publish-81 . Unfortunately, it was  not immediately apparent that using the flag -Dbuild_mw_container_image:False would additionally cause scap to skip the helmfile-defaults update.

Conclusions

Never run deploy to production while listening to Bonnie Tyler. Especially Holding Out for a Hero

"What" dashboards vs "Why" dashboards

Many of the dashboards we have assigned to alerts, point to a dashboard exhibiting the "What". For instance, our "Not enough idle PHP-FPM workers for Mediawikii" alert, points to a dashboard where we can see there many busy PHP workers, and potentially that we are serving some errors. Those alone give very little information as to *what* may have caused this.

It would have made a difference if we had alerts linked to dashboards showing graphs of components that depend or impact each other. For example, high mediawiki latency may be related to high external traffic, or increased edge cache misses. In a similar manner SQLBlobStore keys are directly related to External Storage.

While we have Runbooks attached to several alerts, we could consider working on "Why" dashboards which would not only provide a more holistic overview of a group of components, but have the relevant links that could work as breadcrumbs.

We must learn to coordinate incidents better

During the incident, we had some of our best engineers available. However, ensuring effective coordination among all responders proved to be tricky.

  • The Incident Coordinator found it difficult to coordinate responders effectively during the incident as it wasn’t clear who was working on what and whether multiple people were independently following the same venues of debugging.
  • Furthermore, there was some ambiguity as to when people were joining the Incident Response and when there were parting, making it difficult for the Incident Coordinator to track active participants and areas of focus.
  • The channels were somewhat noisy with input being shared, at times, without context, while also interspersed with discussions not directly related to the IR process. This puts extra burden on the IC to effectively consolidate and interpret key information.
  • Our assessment of the external user impact was not comprehensive. While there was a perception that the situation was critical, external services were minimally affected. A better estimation of the impact would help align our response appropriately.
  • We need to be able to interface with MediaWiki developers during incidents and have their perspective at hand.


Are all of our Alerts serving us well?

Our Alerts were not as useful we would like. First and foremost, the alert that would point us to the right direction -MediaWikiMemcachedHighErrorRate- either resolved immediately. On May 12th, during the incident, we had 78 alerts firing, albeit not for 78 different things. There is a high chance that even if the MediaWikiMemcachedHighErrorRate had fired, we might have see it as a side effect, and thus, not acted on it.

We are not using trace.wikimedia.org during incidents

Good work has been put to https://trace.wikimedia.org/, however, we didn't utilise it during this incident, though it may have been useful

What went well

We had a lot of people available to help, each on specialising in different areas.

What went poorly?

The technical root cause of this incident is not the primary concern. What is more relevant are the reasons that prevented us to detect, even from the March 12th incident, that 3 of our MediaWiki deployments were operating without access to the Memcached cluster. On March 12th, we assumed that the scap rollout was not the problem. By March 13th, we were certain that it was the scap rollout after all, but it was challenging to understand how it contributed to the incident.

Where did we get lucky?

  • There was little impact to external users

Actionables

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents?
Were the people who responded prepared enough to respond effectively
Were fewer than five people paged?
Were pages routed to the correct sub-team(s)?
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours.
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
Was a public wikimediastatus.net entry created?
Is there a phabricator task for the incident?
Are the documented action items assigned?
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.
Were the people responding able to communicate effectively during the incident with the existing tooling?
Did existing monitoring notify the initial responders?
Were the engineering tools that were to be used during the incident, available and in service?
Were the steps taken to mitigate guided by an existing runbook?
Total score (count of all “yes” answers above)