Incidents/2025-03-01 mw-content-history-reconcile-enrich

document status: in-review

Summary

Incident metadata (see Incident Scorecard)
Incident ID	2025-03-01 mw-content-history-reconcile-enrich	Start	2025-03-01 14:36:00
Task	T387906	End	2025-03-02 08:08:45
People paged	0	Responder count	1
Coordinators	@tchin	Affected metrics/SLOs	https://wikitech.wikimedia.org/wiki/SLO/MediaWiki_Content_History
Impact	Downstream consumers of datalake table wmf_content.mediawiki_content_history_v1 may have seen reduced reconciliation rate for the table.

…

Flink taskmanager OOM'd and hit the limit for number of restarts, failing the pipeline.

Timeline

All times in UTC.

2025-03-01 04:01 Alert fires repeatedly: MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag
2025-03-01 14:36 OUTAGE BEGINS
2025-03-01 14:36 Alert fires: MediawikiContentHistoryReconcileEnrichTaskManagerNotRunning
2025-03-01 17:25 @tchin starts investigating, restarts job
2025-03-02 00:14 Job failed @tchin bumps taskmanager replicas to 12 and restarts again
2025-03-02 05:45 Job failed, @tchin bumps taskmanagers replicas to 12 and doubles taskmanager memory
2025-03-02 08:08 Kafka consumer lag falls to 0, backpressure cleared
2025-03-02 08:08 (Voila) OUTAGE ENDS

Detection

Labels
alertname = MediawikiContentHistoryReconcileEnrichTaskManagerNotRunning
app = flink-app-production
component = jobmanager
host = 10_67_24_119
instance = 10.67.24.119:9999
job = k8s-pods
kubernetes_namespace = mw-content-history-reconcile-enrich
kubernetes_pod_name = flink-app-production-7f7df4c48c-fjj9d
pod_template_hash = 7f7df4c48c
prometheus = k8s-dse
release = production
routed_via = production
severity = critical
site = eqiad
source = prometheus
team = data-engineering
type = flink-native-kubernetes
Annotations
dashboard = https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s-dse&var-namespace=mw-content-history-reconcile-enrich&var-helm_release=production&var-operator_name=All&var-flink_job_name=All
description = The mw-content-history-reconcile-enrich Flink cluster in eqiad has no registered TaskManagers
runbook = TODO
summary = The mw-content-history-reconcile-enrich Flink cluster in eqiad has no registered TaskManagers
Source

Conclusions

Root cause was found and discussed on this Slack thread. In summary, the high throughput was generated by the monthly reconcile DAG that will check against all of wiki history:

presto:wmf_content> select count(1) as count, computation_dt, computation_class from wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 where computation_dt = TIMESTAMP '2025-03-01' group by computation_dt, computation_class order by count DESC;
  count  |     computation_dt      | computation_class 
---------+-------------------------+-------------------
 6313306 | 2025-03-01 00:00:00.000 | all-of-wiki-time  
   11531 | 2025-03-01 00:00:00.000 | last-24h          
(2 rows)

Note above how we typically process 11K events, yet on 2025-03-01 we also had 6.3M additional events. We do expect this number to go down over time, but for now the known fix is to double the memory.

Actionables

For long term visibility on issues like this, let's keep the data on wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 for 180 days: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1098.
Let us also keep the doubled memory until further notice: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124504.

Scorecard

Incident Engagement ScoreCard
	Question	Answer (yes/no)	Notes
People	Were the people responding to this incident sufficiently different than the previous five incidents?
	Were the people who responded prepared enough to respond effectively
	Were fewer than five people paged?
	Were pages routed to the correct sub-team(s)?
	Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours.
Process	Was the "Incident status" section atop the Google Doc kept up-to-date during the incident?
	Was a public wikimediastatus.net entry created?
	Is there a phabricator task for the incident?
	Are the documented action items assigned?
	Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence?
Tooling	To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented.
	Were the people responding able to communicate effectively during the incident with the existing tooling?
	Did existing monitoring notify the initial responders?
	Were the engineering tools that were to be used during the incident, available and in service?
	Were the steps taken to mitigate guided by an existing runbook?
Total score (count of all “yes” answers above)