Incidents/2024-09-16 logstash unavailability

document status: draft

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2024-09-16 logstash unavailability	Start	2024-09-16 15:34:16
Task		End	2024-09-16 16:13:31
People paged	3	Responder count	4
Coordinators	1	Affected metrics/SLOs
Impact	The Logstash UI was unavailable (e.g., users would see 502 responses).

The upgrade-phatality.sh script, used during scap deployment of the phatality plugin to logstash hosts, attempted to check whether the plugin was already installed on the host using a command wrapped in an unnecessary sudo (prior to this change). Since the command was not allowlisted in sudoers, the check failed and the script continued to install the plugin without first removing the old version.

This resulted in duplicate registration of the plugin, which was in turn considered a fatal error by the opensearch-dashboard service when it was subsequently restarted by scap. As a result, no backends were available to serve requests to https://logstash.wikimedia.org.

Timeline

All times in UTC.

15:34:00 phatality plugin deployment starts
15:34:16 phatality plugin deployment ends OUTAGE BEGINS
15:36:28 first alert for logstash unavailability: PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana7_443: Servers logstash2032.codfw.wmnet, logstash2030.codfw.wmnet, logstash2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
15:36:57 first paging alert for logstash unavailability: FIRING: [2x] ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
15:42:55 herron discovers that the opensearch-dashboard service on logstash1023 is failing due to duplicate registration of phatality plugins: Sep 16 15:42:14 logstash1023 opensearch-dashboards[2609202]: {"type":"log","@timestamp":"2024-09-16T15:42:14Z","tags":["fatal","root"],"pid":2609202,"message":"Error: Plugin with id \"phatality\" is already registered!\n at MergeMapSubscriber.project
15:49:00 herron removes the phatality plugin on logstash1023 and restarts opensearch-dashboards: !log logstash1023:/usr/share/opensearch-dashboards/bin# /usr/share/opensearch-dashboards/bin/opensearch-dashboards-plugin remove phatality
logstash1023 recovers and logstash1032 is depooled for debugging. Note: At this point, users begin to see recovery, as a single backend is now healthy in the primary site of the underlying service (logstash.discovery.wmnet).
16:00:37 rzl applies the same fix to all remaining impacted hosts: !log rzl@cumin1002:~$ sudo cumin 'O:logging::opensearch::collector and not logstash1032.eqiad.wmnet' '/usr/share/opensearch-dashboards/bin/opensearch-dashboards-plugin --allow-root remove phatality'
16:12:27 rzl restarts the opensearch-dashboard service on all remaining impacted hosts: !log rzl@cumin1002:~$ sudo cumin logstash[2023,2025,2030-2032].codfw.wmnet,logstash[1025,1030,1032].eqiad.wmnet 'systemctl restart opensearch-dashboards'
16:12:39 first recovery notification: RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
16:13:31 last recovery notification: RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal OUTAGE ENDS
18:34:00 dancy deploys a new version of phatality successfully, verifying the fix

Detection

The issue was first detected by PyBal backends health check alerts starting at 15:36:28, followed shortly after by the first and only paging alert for service-catalog probe failures at 15:36:57:

FIRING: [2x] ProbeDown: Service kibana7:443 has failed probes (http_kibana7_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#kibana7:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown

TODO: Were there dots that somehow connected this to the earlier phatality deployment other than temporal correlation?

Conclusions

What went well?

Monitoring surfaced the issue within minutes of the trigger deployment (161 seconds elapsed from end-of-deploy to page).
The alert was quickly correlated with the recent phatality plugin deployment.

What went poorly?

TODO: Something seems off here in terms of change validation: it's surprising that this spread to all of the logstash hosts even though it should have been clear that the service fails as soon as a single host was updated. Is this a tooling issue related to T374880?
- Stated differently, it's not just that this rarely-used deployment workflow was subtly broken (the sudoers part), that should have ideally stopped at one host.

Where did we get lucky?

The incident occurred while responders were on hand who were sufficiently familiar with the opensearch-dashboards service and how remove a problematic plugin.

Links to relevant documentation

…

Add links to information that someone responding to this alert should have (runbook, plus supporting docs). If that documentation does not exist, add an action item to create it.

Actionables

~~https://phabricator.wikimedia.org/T374880: phatality deployments via scap proceed to later check stages despite earlier stages failing.~~
https://phabricator.wikimedia.org/T375294: Outstanding issues with scap phatality deployment

Create a list of action items that will help prevent this from happening again as much as possible. Link to or create a Phabricator task for every step.

Add the #Sustainability (Incident Followup) and the #SRE-OnFire Phabricator tag to these tasks.

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?
	Were the people who responded prepared enough to respond effectively
	Were fewer than five people paged?
	Were pages routed to the correct sub-team(s)?
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
	Was a public wikimediastatus.net entry created?
	Is there a phabricator task for the incident?
	Are the documented action items assigned?
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.
	Were the people responding able to communicate effectively during the incident with the existing tooling?
	Did existing monitoring notify the initial responders?
	Were the engineering tools that were to be used during the incident, available and in service?
	Were the steps taken to mitigate guided by an existing runbook?
Total score (count of all “yes” answers above)