Incidents/2025-03-12 ExternalStorage Database Cluster Overload
document status: final
Summary
Incident ID | 2025-03-12 ExternalStorage Database Cluster Overload | Start | 2025-03-12 10:44:00, 2025-03-13 11:15:00 |
---|---|---|---|
Task | T389498 | End | 2025-03-12 13:44:00, 2025-03-13 14:15:00 |
People paged | 2 (oncallers) | Responder count | 9: (round 1) volans, jynus, joe, claime (round2) Amir, brouberol, gehel, gmodena, (round 3) Amir, marostegui, joe, effie, hugh |
Coordinators | 3: effie, jynus, volans | Affected metrics/SLOs | No relevant SLOs exist |
Impact | We had minimal user facing impact: Global traffic had a ~0.03-0.04% error rate, edits were affected for a very short period of time, while p50 and p75 latency for Web and External API, were not affected |
TL;DR: Within a span of two days, we observed high latency and database errors on mw-{api-int,parsoid, jobrunner}
, all of which are deployments supporting MediaWiki for non-user-facing purposes. Our External Storage cluster experienced a significant load increase and a high number of connections. Notably, user-facing traffic was little impacted by this incident
During the scheduled PHP 8.1 Scap rollout for mw-{api-int,parsoid,jobrunner}
, on March 12th & 13th 2025, we found ourselves having said deployments running without access to the Memcached cluster and without a valid prometheus StatsD address.
Without Memcached access, MediaWiki had to query the databases for objects that would typically be cached, leading to increased database load. At the time, MediaWiki was pushing stats to both Graphite and prometheus-statsd-exporter
, thus the lack of a valid prometheus statsd address had no impact.
During the incidents, we observed high load and a sharp increase in connections to the External Storage cluster, while the majority of MediaWiki errors (channel: error
) visible at the time, were database related. The External Storage hosts are used to store the compressed text content of wiki page revisions
However, there was an overwhelming volume of memcached errors events (channel: memcached
, ~2mil/min) which logstash was unable to process quickly enough, resulting in an observability gap where:
- Logstash itself probably was not able to render the
channel:memcached
logstash dashboard, until after the incident, allowing it to consume all its backlog - Delays in the pipeline pipeline caused the MediaWikiMemcachedHighErrorRate alert to be missing, while the corresponding Grafana graph displayed no data during the incidents.
TODO: Assess actual user impact based on traffic metrics.
Edge (Varnish) | Memcached & mcrouter |
---|---|
![]() |
![]() |
![]() |
Timeline
All times in UTC.
12th March 2025
- 10:44 effie runs scap to deploy 1126607[puppet] and 1126650 [deployment-charts], for the
mw-{api-int,parsoid,jobrunner}
PHP 8.1 rollout
- 10:58 UTC: alerts for high backend response times started coming in
- 13:19 gmodena and brouberol deployed a patch in cirrus-streaming-updater to reduce SUP parallelism 1126988[deployment-charts]
- 13:33 Amir1 reduces the concurrency of categoryMembershipChange job in changeprop-jobqueue 1127000[deployment-charts]
- 13:44 effie re-run Scap after revering 1126607[puppet] and 1126650 [deployment-charts]
13th March 2025
- 11:15 effie begins a scap deployment (yes, again), to deploy 1127476[puppet] and 1127478[deployment-charts] for the
mw-{api-int,parsoid,jobrunner}
PHP 8.1 rollout - 11:34 effie performs a rolling restart on changeprop-jobqueue
- 11:43 effie: rolling restarting mw-api-int
- 11.44 Amir1disables the categorymembership job in changeprop-jobqueue 1127500[deployment-charts]
- 11:46 effie reverts 1127476[puppet] and 1127478[deployment-charts]
- 11:58 effie redeploys mw-api-int to pick up the revert
- 12:07 hnowlan bumps the number of replicas for mw-api-int
- 12:28 effie redeploys mw-jobrunner to pick up the revert
- 14:06 effie redeploys mw-parsoid to pick up the revert
Detection
Our working theories were:
- High traffic from specific network/client
- Scheduled jobs that went awry
- Greedy changeprop-jobqueue jobs
All the above are the usual suspects when we experience MediaWiki latency and/or Database load. Detection was not easy here, for reasons explained in “what went wrong”. However, we were able to thoroughly sort out what went on after the fact.
Contributing Factors
Key factors that contributed in causing this incident, as well as delaying its root cause, were: scap, php-fpm envvars.inc include file, and the logstash-prometheus pipeline.
What does Scap do?
Scap is our deployment tool for MediaWiki. Scap takes care of 3 very important steps:
- build and push MediaWiki images
- Update
helmfile-defaults
with the latest image version tag per deployment, and per release.- The image flavour (if it is a 7.4 or a 8.1 image) of each deployment-release combination is defined in puppet in
kubernetes.yaml
- The image flavour (if it is a 7.4 or a 8.1 image) of each deployment-release combination is defined in puppet in
- Runs
helmfile
on all deployments running MediaWiki
To provide a visual example, the latest scap run updated the helmfile-defaults for the main release of mw-parsoid (aka mw-parsoid-main) as follows:
docker:
registry: docker-registry.discovery.wmnet
main_app:
image: restricted/mediawiki-multiversion:2025-03-18-101751-publish-81
mw:
httpd:
image_tag: restricted/mediawiki-webserver:2025-03-18-101751-webserver
What is this envvars.inc
include file in php-fpm?
We are exporting two very important environmental variables to php-fpm
MCROUTER_SERVER:
static IP address defined in deployment-charts, essentially the memcached/mcrouter address, defaults to127.0.0.1:11213
STATSD_EXPORTER_PROMETHEUS_SERVICE_HOST
: populated and injected into pods by the k8s api, unset by default
In kubernetes, we put both variables in a ConfigMap
called mediawiki-main-php-envvars
, which in turn mount in the container as /etc/php/
<X.Y>/fpm/env/envvars.inc
PHP-FPM reads the environmental variables from a hardcoded include directory, whose exact location depends on the PHP version.
In the publish-74
container image, that would be:
[www]
listen = ${FCGI_URL}
<snip>
; MediaWiki helm chart via the php.envvars value.
include = /etc/php/7.4/fpm/env/*.inc
In the publish-81
container image, that would be:
[www]
listen = ${FCGI_URL}
<snip>
; MediaWiki helm chart via the php.envvars value.
include = /etc/php/8.1/fpm/env/*.inc
So, what was broken then?
Our PHP 8.1 mw-{api-int,parsoid,jobrunner}
rollout consisted of two sister patches:
- 1126607[puppet] switching
mw-{api-int,parsoid, jobrunner}
the MediaWiki image flavour frompublish-74
topublish-81
- 1126650 [deployment-charts] which in practice would change the mount location of
envvars.inc
from/etc/php/7.4/fpm/env/envvars.inc
to/etc/php/8.1/fpm/env/envvars.inc
We performed a scap deployment to deploy the above. Our expectation was that after the deployment, was that we would have:
mw-{api-int, parsoid, jobrunner}
running themediawiki-multiversion publish-81
image and- The
mediawiki-main-php-envvars
configMap
mounted as/etc/php/8.1/fpm/env/envvars.inc
Due to an unexpected Scap behavior, explained below, what was actually rolled out in production was :
mw-{api-int,parsoid,jobrunner}
running themediawiki-multiversion publish-74
- The
mediawiki-main-php-envvars
configMap
, mounted at/etc/php/8.1/fpm/env/envvars.inc
As the PHP 7.4 image (publish-74) was in use, the includes under /etc/php/7.4/fpm/env/*.inc, contained default values, so they were not useful.
How did we break this?
During the scheduled deployment, the Scap command was executed with the flag -Dbuild_mw_container_image:False
. This flag is commonly utilised by Site Reliability Engineers (SREs) as, in most cases, our changes do not necessitate rebuilding container images. Specifically, transitioning the main release of mw-{api-int, parsoid, jobrunner}
to the publish-81
image would not require an image rebuild as we have already publish-81
built and cached.
However, this transition would necessitate updates to the helmfile-defaults
of the main releases for mw-{api-int,parsoid,jobrunner}
so to replace the latest -publish-74
image tag with the latest -publish-81
. Unfortunately, it was not immediately apparent that using the flag -Dbuild_mw_container_image:False
would additionally cause scap to skip the helmfile-defaults
update.
Conclusions
Never run deploy to production while listening to Bonnie Tyler. Especially Holding Out for a Hero
"What" dashboards vs "Why" dashboards
Many of the dashboards we have assigned to alerts, point to a dashboard exhibiting the "What". For instance, our "Not enough idle PHP-FPM workers for Mediawikii" alert, points to a dashboard where we can see there many busy PHP workers, and potentially that we are serving some errors. Those alone give very little information as to *what* may have caused this.
It would have made a difference if we had alerts linked to dashboards showing graphs of components that depend or impact each other. For example, high mediawiki latency may be related to high external traffic, or increased edge cache misses. In a similar manner SQLBlobStore keys are directly related to External Storage.
While we have Runbooks attached to several alerts, we could consider working on "Why" dashboards which would not only provide a more holistic overview of a group of components, but have the relevant links that could work as breadcrumbs.
We must learn to coordinate incidents better
During the incident, we had some of our best engineers available. However, ensuring effective coordination among all responders proved to be tricky.
- The Incident Coordinator found it difficult to coordinate responders effectively during the incident as it wasn’t clear who was working on what and whether multiple people were independently following the same venues of debugging.
- Furthermore, there was some ambiguity as to when people were joining the Incident Response and when there were parting, making it difficult for the Incident Coordinator to track active participants and areas of focus.
- The channels were somewhat noisy with input being shared, at times, without context, while also interspersed with discussions not directly related to the IR process. This puts extra burden on the IC to effectively consolidate and interpret key information.
- Our assessment of the external user impact was not comprehensive. While there was a perception that the situation was critical, external services were minimally affected. A better estimation of the impact would help align our response appropriately.
- We need to be able to interface with MediaWiki developers during incidents and have their perspective at hand.
Are all of our Alerts serving us well?
Our Alerts were not as useful we would like. First and foremost, the alert that would point us to the right direction -MediaWikiMemcachedHighErrorRate
- either resolved immediately. On May 12th, during the incident, we had 78 alerts firing, albeit not for 78 different things. There is a high chance that even if the MediaWikiMemcachedHighErrorRate
had fired, we might have see it as a side effect, and thus, not acted on it.
We are not using trace.wikimedia.org during incidents
Good work has been put to https://trace.wikimedia.org/, however, we didn't utilise it during this incident, though it may have been useful
What went well
We had a lot of people available to help, each on specialising in different areas.
What went poorly?
The technical root cause of this incident is not the primary concern. What is more relevant are the reasons that prevented us to detect, even from the March 12th incident, that 3 of our MediaWiki deployments were operating without access to the Memcached cluster. On March 12th, we assumed that the scap rollout was not the problem. By March 13th, we were certain that it was the scap rollout after all, but it was challenging to understand how it contributed to the incident.
Where did we get lucky?
- There was little impact to external users
Links to relevant documentation
- …
Actionables
- Reduce the amount of messages sent through channel:Memcached during failures
- Add helm rollback functionality to scap
- Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False
- Consider removing envvars.inc from MediaWiki images
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | ||
Were the people who responded prepared enough to respond effectively | |||
Were fewer than five people paged? | |||
Were pages routed to the correct sub-team(s)? | |||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | |||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | ||
Was a public wikimediastatus.net entry created? | |||
Is there a phabricator task for the incident? | |||
Are the documented action items assigned? | |||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | |||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | ||
Were the people responding able to communicate effectively during the incident with the existing tooling? | |||
Did existing monitoring notify the initial responders? | |||
Were the engineering tools that were to be used during the incident, available and in service? | |||
Were the steps taken to mitigate guided by an existing runbook? | |||
Total score (count of all “yes” answers above) |