Incidents/2020-05-01 vc-link-failure

document status: final

Summary

The virtual chassis link between asw2-d1-eqiad and asw2-d8-eqiad failed in two steps.

First on Friday the 1st where it was causing packet loss for hosts on D1 without any other signs of failures.

This packet loss caused connectivity issues between MediaWiki appservers (and API servers at a lower scale) and memcache servers. Resulting in a significant increase of MediaWiki exceptions being served to the users.

This got worked around for the weekend by depooling D1 servers. At this point the cause of the packet loss was unknown.

The day after, on Saturday, hosts in D8 started seeing the same issues as in D1. This time the switches were logging errors about the D1-D8 link. Disabling the link solved the issues.

Impact: This had little to no effect on traffic (Varnish_HTTP_Total), error rates (ATS availability) and latencies (Navtiming requests) for anonymous users. A increase in error rates (~1% of requests had errors, with a short spike to ~7.5%, Appserver errors) and an increase in tail latency (around plus 100%-150%, Appserver p95) has been observed for logged in users, though.

Timeline

All times in UTC. Friday 1st:

05:21 MW exceptions starts being reported on #wikimedia-operations <+icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops OUTAGE BEGINS
05:32 <+icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
05:33 < marostegui> wow those fatals really increased
05:33 <+icinga-wm> PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5009 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
Joe sends the following patches: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/593728/ https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/593727/
7:40 Arzhel checks switches (especially D/D1) nothing out of ordinary
Large but steady increase of TCP retransmits on mc1021/1029 - https://grafana.wikimedia.org/d/000000365/network-performances?panelId=15&fullscreen&orgId=1&from=now-7d&to=now&var-server=mc1029&var-datasource=eqiad%20prometheus%2Fops
8:29 <elukey> _joe_ there are some things that are off, namely a lot of traffic patterns showing spikes every 10m, I am wondering if in some twisted way the 10m TTL of the gutter is somehow exacerbating this problem
8:37 <elukey> so mw1331 for example doesn't show tkos, I think it is only the servers in D1
8:43 <_joe_> I will bring back mw1409 and mw1407 in the pool, and we can depool those servers in D1
8:54 <_joe_> !log depooled all servers in the app pool in rack D1 OUTAGE WORKED AROUND
8:55 +icinga-wm> IRC echo bot RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
19:57 rzl depools api servers in D1 (mw1356-1362) at _joe_’s suggestion, in response to flapping alerts like “PROBLEM - PHP7 rendering on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds”

Saturday 2nd:

(Overnight) Wall of flapping PROBLEM - PHP7 rendering on mwXXXX is CRITICAL: CRITICAL - Socket timeout after 10 seconds OUTAGE RESURFACE
06:42 Giuseppe and Luca start investigating
06:52 Arzhel starts investigating
07:08 <XioNoX> asw2-d-eqiad> request virtual-chassis vc-port delete pic-slot 1 port 0 member 1 OUTAGE ENDS
07:49 <oblivian@cumin1001> conftool action : set/pooled=yes; selector: name=mw13(49|5[0-9]|6[0-2])\.eqiad\.wmnet

Detection

<+icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
<+icinga-wm> PROBLEM - PHP7 rendering on mwXXXX is CRITICAL: CRITICAL - Socket timeout after 10 seconds

Did the appropriate alert(s) fire? Yes
Was the alert volume manageable? Yes
Did they point to the problem with as much accuracy as possible? No

The root cause didn't generate any logs at first, and when it did, those logs didn't trigger alerts.

Conclusions

Packet loss through Virtual Chassis Fabric are difficult to pinpoint
Higher layers monitoring worked as expected
From history, this failure scenario has a low probability of happening, and is now documented

What went well?

We had enough capacity to depool impacted mediawiki hosts
Once the failure generated logs, the root cause and fix were quick to identify and apply
SREs quickly identified D1 then D8 as common factor

What went poorly?

The first VC link failure didn't generate any switch side errors
The issue started happening on a Friday and re-appeared on a Saturday
The issue would not have been noticed if SREs didn't look at alerts on a weekend

Where did we get lucky?

SREs looked at alerts on a weekend

How many people were involved in the remediation?

4 SREs

Links to relevant documentation

Actionables

Either re-cable or cleanup disabled cable - https://phabricator.wikimedia.org/T251663
Add log alerting for VC link failure - https://phabricator.wikimedia.org/T251663