Incidents/2025-03-01 mw-content-history-reconcile-enrich
Appearance
document status: in-review
Summary
Incident ID | 2025-03-01 mw-content-history-reconcile-enrich | Start | 2025-03-01 14:36:00 |
---|---|---|---|
Task | T387906 | End | 2025-03-02 08:08:45 |
People paged | 0 | Responder count | 1 |
Coordinators | @tchin | Affected metrics/SLOs | https://wikitech.wikimedia.org/wiki/SLO/MediaWiki_Content_History |
Impact | Downstream consumers of datalake table wmf_content.mediawiki_content_history_v1 may have seen reduced reconciliation rate for the table. |
…
Flink taskmanager OOM'd and hit the limit for number of restarts, failing the pipeline.
Timeline
All times in UTC.
- 2025-03-01 04:01 Alert fires repeatedly: MediawikiContentHistoryReconcileEnrichHighKafkaConsumerLag
- 2025-03-01 14:36 OUTAGE BEGINS
- 2025-03-01 14:36 Alert fires: MediawikiContentHistoryReconcileEnrichTaskManagerNotRunning
- 2025-03-01 17:25 @tchin starts investigating, restarts job
- 2025-03-02 00:14 Job failed @tchin bumps taskmanager replicas to 12 and restarts again
- 2025-03-02 05:45 Job failed, @tchin bumps taskmanagers replicas to 12 and doubles taskmanager memory
- 2025-03-02 08:08 Kafka consumer lag falls to 0, backpressure cleared
- 2025-03-02 08:08 (Voila) OUTAGE ENDS
Detection
Labels
alertname = MediawikiContentHistoryReconcileEnrichTaskManagerNotRunning
app = flink-app-production
component = jobmanager
host = 10_67_24_119
instance = 10.67.24.119:9999
job = k8s-pods
kubernetes_namespace = mw-content-history-reconcile-enrich
kubernetes_pod_name = flink-app-production-7f7df4c48c-fjj9d
pod_template_hash = 7f7df4c48c
prometheus = k8s-dse
release = production
routed_via = production
severity = critical
site = eqiad
source = prometheus
team = data-engineering
type = flink-native-kubernetes
Annotations
dashboard = https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?orgId=1&var-datasource=eqiad%20prometheus/k8s-dse&var-namespace=mw-content-history-reconcile-enrich&var-helm_release=production&var-operator_name=All&var-flink_job_name=All
description = The mw-content-history-reconcile-enrich Flink cluster in eqiad has no registered TaskManagers
runbook = TODO
summary = The mw-content-history-reconcile-enrich Flink cluster in eqiad has no registered TaskManagers
Source
Conclusions
Root cause was found and discussed on this Slack thread. In summary, the high throughput was generated by the monthly reconcile DAG that will check against all of wiki history:
presto:wmf_content> select count(1) as count, computation_dt, computation_class from wmf_content.inconsistent_rows_of_mediawiki_content_history_v1 where computation_dt = TIMESTAMP '2025-03-01' group by computation_dt, computation_class order by count DESC; count | computation_dt | computation_class ---------+-------------------------+------------------- 6313306 | 2025-03-01 00:00:00.000 | all-of-wiki-time 11531 | 2025-03-01 00:00:00.000 | last-24h (2 rows)
Note above how we typically process 11K events, yet on 2025-03-01 we also had 6.3M additional events. We do expect this number to go down over time, but for now the known fix is to double the memory.
Actionables
- For long term visibility on issues like this, let's keep the data on
wmf_content.inconsistent_rows_of_mediawiki_content_history_v1
for 180 days: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1098. - Let us also keep the doubled memory until further notice: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1124504.
Scorecard
Question | Answer
(yes/no) |
Notes | |
---|---|---|---|
People | Were the people responding to this incident sufficiently different than the previous five incidents? | ||
Were the people who responded prepared enough to respond effectively | |||
Were fewer than five people paged? | |||
Were pages routed to the correct sub-team(s)? | |||
Were pages routed to online (business hours) engineers? Answer “no” if engineers were paged after business hours. | |||
Process | Was the "Incident status" section atop the Google Doc kept up-to-date during the incident? | ||
Was a public wikimediastatus.net entry created? | |||
Is there a phabricator task for the incident? | |||
Are the documented action items assigned? | |||
Is this incident sufficiently different from earlier incidents so as not to be a repeat occurrence? | |||
Tooling | To the best of your knowledge was the open task queue free of any tasks that would have prevented this incident? Answer “no” if there are open tasks that would prevent this incident or make mitigation easier if implemented. | ||
Were the people responding able to communicate effectively during the incident with the existing tooling? | |||
Did existing monitoring notify the initial responders? | |||
Were the engineering tools that were to be used during the incident, available and in service? | |||
Were the steps taken to mitigate guided by an existing runbook? | |||
Total score (count of all “yes” answers above) |