Incidents/2021-07-14 eventgate-analytics latency spike caused MW app server overload
document status: in-review
Summary
While working on updating EventGate to support Prometheus, Andrew Otto deployed the changes to eventgate-analytics in codfw (then-active DC). This change removed the prometheus-statsd-exporter container in favor of direct Prometheus support, as added in recent versions of service-runner and service-template-node.
The deploy went fine in the idle "staging" and "eqiad" clusters, but when deploying to codfw, request latency from MediaWiki to eventgate-analytics spiked, which caused PHP worker slots to fill up, which in turn caused some MediaWiki API requests to fail.
The helm tool noticed that the eventgate-analytics deploy to codfw itself was not doing well, and auto-rolled back the deployment:
$ kube_env eventgate-analytics codfw; helm history production REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION [...] 4 Wed Jul 14 16:07:12 2021 SUPERSEDED eventgate-0.3.1 Upgrade "production" failed: timed out waiting for the co... 5 Wed Jul 14 16:17:18 2021 DEPLOYED eventgate-0.2.14 Rollback to 3
Impact: For ~10 minutes, MediaWiki API clients experienced request failures.
Documentation:
- Grafana: Envoy telemetry
- Grafana: Application Servers dashboard
- Grafana: Envoy telemetry / Upstream latency
Actionables
- Figure out why this happened and fix. Based on this log message, it seems likely that a bug in the service-runner prometheus integration caused the nodejs worker process to die. [DONE]
- Further investigation uncovered that
require('prom-client')
within a worker causes the observed issue. Both service-runner and node-rdkafka-prometheus require prom-client. It was proposed to patch node-rdkafka-prometheus to handle passing in the prom-client instance. - node-rdkafka-prometheus is an unmaintained project, so we have forked it to @wikimedia/node-rdkafka-propetheus and fixed the issue there. Additionally, if this issue in prom-client is fixed, we probably won't need the patch we made to node-rdkafka-prometheus for this fix.
- Further investigation uncovered that