Jump to content

Incidents/2025-02-17 maps

From Wikitech

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID 2025-02-17 maps Start 2025-02-17 14:25:00
Task T386648 End 2025-02-17 15:15:00
People paged 2 (on-calls) Responder count 1
Coordinators None Affected metrics/SLOs No relevant SLOs exist
Impact An approximate 316000 requests for maps.wikimedia.org ended up in HTTP 50X errors. Users saw the impact when hitting the maps domain directly or when using it indirectly while visualizing maps embedded in various projects.

The SRE team was trying to move the Kartotherian service from bare metal nodes to Kubernetes. The service is the frontend responsible for the maps.wikimedia.org traffic, rendering maps in various formats and optionally additional data/shapes (for example, a path from A to B hovering the map of a given city).

The Service was deployed on both bare metal and Kubernetes, and Luca (SRE Infra Foundations) was trying to shift traffic away from bare metal nodes to progress the migration. Due to a missing setting on the Kubernetes cluster, Kartotherian running on it wasn't able to respond to any HTTP request and shifting traffic away from bare metal nodes (properly serving traffic) caused increased pressure in the remaining capacity ending up in overload and timeouts.

Timeline

Link to the SAL: https://sal.toolforge.org/production?p=0&q=elukey&d=2025-02-17

All times in UTC.

  • 12:49 Luca pools in 4 Kubernetes workers behind the Kartotherian's load balancer (LVS svc IP).
  • 14:09 Luca deploys to Kartotherian on Kubernetes to fix some settings, increasing also a bit the overall capacity.
  • 14:15 Luca depools 3 maps bare metal nodes (maps1005, maps2005 and maps2006). Note: maps1006 was already depooled due to a previous load test (it was never pooled in, but it was known).
  • 14:25 OUTAGE BEGINS (remaining bare metal hosts in EQIAD under pressure and failing health checks from the load balancer).
  • 14:56 Luca repools maps1005, maps1006, maps2005 and maps2006 to the load balancer.
  • 15:13 Luca roll restart Kartotherian on the maps1* hosts (all bare metals).
  • 15:15 OUTAGE ENDS

Detection

Several alerts fired, including a page for Kartotherian:

PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kartotherian-k8s-ssl_6543: Servers maps1007.eqiad.wmnet are marked down but pooled: kartotherian-ssl_443: Servers maps1007.eqiad.wmnet, maps1010.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal

FIRING: [3x] ATSBackendErrorsHigh: ATS: elevated 5xx errors from kartotherian.discovery.wmnet #page

It took some minutes for the bare metal servers to reach the tipping point, but we got all the alarms needed in timely manner.

Conclusions

The main issue was that the LVS Loopback IP for Kartotherian wasn't deployed to the Kubernetes nodes, as indicated clearly by the documentation for a new service, so IP packets looped between LVS and the Kubernetes nodes until reaching their max TTL (ending up with connection errors). Due to how ATS manages TCP connection pools on the CDN (preferring long lived connections and discarding connection failures), most of the traffic was handled by the bare metal nodes and the impact was limited in time (if ATS didn't work in this way, we'd have had impact from 12:49 onward).

The missing setting on the Kubernetes nodes was difficult to detect, since this use case is special, namely it is not a regular new service request: we mixed bare metal nodes and Kubernetes one to smooth over the transition. When setting up a new service it is more straightforward to see if any traffic is not handled properly, since the backend capacity is all of the same kind/type, not a mix. It is surely something that could have been prevented with more and in depth testing/review, but in the near future these corner cases will be gone so hopefully we'll have less issues like this one.

What went well?

  • Luca was able to spot user impact really soon when it started happening, and the traffic was restored to its normal flow relatively soon.

What went poorly?

The complete list would probably require a separate page, but the major points are:

  • Luca was probably over-confident in Kartotherian on Kubernetes due to the tests done in https://phabricator.wikimedia.org/T384530, where the Kubernetes pods were tested separately without reporting any connectivity issue. The problem was hiding in another place, that was still in the critical request path from user to backend service, and that should have been tested before pooling in any traffic.
  • Kartotherian, and the maps cluster in general, is not a well known cluster, SREs often struggle to understand its failure modes. Right after seeing errors, the correct procedure should have been to rollback to the previous state immediately, meanwhile some time was spent debugging since it seemed that the bare metal nodes were at fault, not the Kubernetes capacity.
  • This is probably the most important one. Part of the efforts of adapting Kartotherian for Kubernetes was to create a Prometheus statsd exporter configuration to build a more precise and complete Grafana dashboard for the service. The dashboard clearly showed no traffic to the pods after the Kubernetes workers were pooled in the load balancer, so nothing else should have been done (like depooling bare metal capacity) until this issue was resolved. Luca thought it was a mistake in the statsd config, or something related to really tiny traffic hitting the pods, and he chose to proceed anyway to see the difference with more traffic. Big lesson learned is to always check the expected/current traffic for a service before touching anything about it, rather than just assume its volume without any proof.

Where did we get lucky?

As written above, the CDN's configuration avoided more impact to the external users. ATS (that connects the CDN to the backend services, like Kartotherian) keeps a pool of TCP connections to backend nodes preferring long lived connections, so it naturally tend to discard connection failures. When the kubernetes workers were added to handle live traffic, they didn't respond to any TCP connection request from ATS successfully, not ending up in the active pool and hence not impacting live traffic.

In this case I don't think extra documentation is needed, everything is already on Wikitech and on the alerts, it was just an operator error.

Actionables

I am inclined to not add any specific actionable, since this is a special use case and it shouldn't re-happen in the future (services on bare metal nodes are not allowed anymore, we are focusing on Kubernetes only). The best action item is to spread the knowledge about what happened to warn more SREs about things to check/review when making a change to a live service.

Scorecard

Incident Engagement ScoreCard
Question Answer

(yes/no)

Notes
People Were the people responding to this incident sufficiently different than the previous five incidents? no idea
Were the people who responded prepared enough to respond effectively yes
Were fewer than five people paged? yes
Were pages routed to the correct sub-team(s)? yes
Were pages routed to online (business hours) engineers?  Answer “no” if engineers were paged after business hours. yes
Process Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? no
Was a public wikimediastatus.net entry created? no
Is there a phabricator task for the incident? yes
Are the documented action items assigned? no
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? yes
Tooling To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. yes
Were the people responding able to communicate effectively during the incident with the existing tooling? yes
Did existing monitoring notify the initial responders? yes
Were the engineering tools that were to be used during the incident, available and in service? yes
Were the steps taken to mitigate guided by an existing runbook? no
Total score (count of all “yes” answers above) 10