WMDE/Wikidata/Runbooks/Change dispatching/Alert
The Wikidata Change Dispatching mechanism has several alerts associated with it. This runbook describes what to do when one of those alerts fires.
Overview
See WMDE/Wikidata/Dispatching for an explanation of how that process works and which moving parts are involved.
The alerts are defined on https://grafana.wikimedia.org/d/TUJ0V-0Zk/wikidata-alerts
General advice for alerts
If the alert just happened, have a quick look at the latest messages in #wikimedia-operations connect (public log) – if there are more widespread issues, no Wikidata-specific actions from us may be necessary.
If the alert is ongoing, and you’re not sure whether it’s a Wikidata-specific issue or not, you could additionally check Wikimedia Status for any messages. If you’re in Wikidata-related Telegram groups, you can also look if anyone complained there (but these aren’t publicly logged).
Number of rows in wb_changes table
The wb_changes
table on Wikidata serves as the "buffer" from which the DispatchChanges job collects changes to dispatch to the client wikis. If that table keeps growing, then that implies that there might be a problem with the job and changes not getting dispatched.
The alert is currently set at 30,000 (30K) rows. TODO: that value needs to be adjusted as it does not reflect the typical behavior of that table. This is tracked in T349196.
Possible causes for spikes in this table in the past
- deployments of changes to the helmfile controlling these things (example) result in a (short) interruption to queueing and running of jobs and thus result in a spike in the number of rows in the
wb_changes
table. Usually, that spike is gone quickly.
Delay injecting Recent Changes, aggregated across client wikis
This is the actual time between the change being made and it being inserted in the recent changes of the client wiki. The alert is on the 99th percentile of the metric, aggregated across all client wikis. The SLO for this is 10 minutes, the alerting value is 60 minutes.
No Data Alert
This might have a couple of causes:
- no edits are happening on Wikidata (there should be more alerts about that)
- the dispatching chain of jobs is broken somehow somewhere
genuine "Time exceeds threshold" Alerts
This means that the pipeline is still working, but might be overloaded somewhere. To figure out the bottleneck, have a look at the Edit dispatching via jobs Grafana dashboard, and especially at the three job dashboards linked at the top: DispatchChanges, EntityChangeNotification, and wikibase-InjectRCRecords. In those three dashboards, look in particular at the job insertion rate, job processing rate, and normal job backlog time.
(Note that you may have to switch to the current DC at the top-left.)
the immediate root cause has already ceased
In the past, there were occasions when we received this alert where we saw an ongoing elevated wikibase-InjectRCRecords
"job insertion rate" and a spike in the EntityChangeNotification
"job inseration rate". But the latter had already returned to normal by the time we had received the alert and started to investigate. All that is to do in such a case is to wait for both the EntityChangeNotification
and wikibase-InjectRCRecords
job queues to churn through their respective backlogs and return to normal.
the immediate root cause seems to still be active
In the past, this overload of the system was caused by many Items that had a lot of subscribed wikis being edited.
That can be determined by running the following query against the wikidatawiki production database:
MariaDB [wikidatawiki]> SELECT SUBSTR(rc_timestamp, 1, 10), COUNT(*) FROM recentchanges JOIN wb_changes_subscription ON rc_title = cs_entity_id WHERE rc_namespace = 0 AND rc_timestamp >= '20220510000000' GROUP BY SUBSTR(rc_timestamp, 1, 10);
Note that you need to adjust the value for rc_timestamp
! (This query groups by the hour, so use a threshold a couple of hours before the alert started to have a baseline to compare against.)
This will produce a list of the number of edit Items with subscribers per hour. If the value is significantly elevated, then that could be the cause.
However, it might also be the case that while the absolute number of edited Items with subscribers is not higher than usual, those that are edited might have many more subscribers than average.
To look into that, a next step can be to check at a smallish client wiki whether there are many more than usual changes coming from Wikidata:MariaDB [euwiki]> SELECT SUBSTR(rc_timestamp, 1, 10), COUNT(*) FROM recentchanges WHERE rc_timestamp >= '20220509000000' AND rc_source = 'wb' GROUP BY SUBSTR(rc_timestamp, 1, 10);
Note that you again need to adjust the value for rc_timestamp
!
If that is indeed the case as well, then you can try to figure out if all of those changes happen to come from one particular prolific editor:MariaDB [euwiki]> SELECT actor_name, COUNT(*) FROM recentchanges JOIN actor ON rc_actor = actor_id WHERE rc_source = 'wb' AND rc_timestamp >= '20220510100000' GROUP BY actor_name ORDER BY COUNT(*) DESC LIMIT 5;
Note that you again need to adjust the value for rc_timestamp
!
If that is the case, you could have a look at https://editgroups.toolforge.org/?user=<username> to potentially see what they're doing and maybe ask them kindly to slow down a bit.
DispatchChanges normal job backlog time (p50, 15min)
This job distributes changes on entities to the client wikis subscribed to them. If the backlog keeps growing, that could mean that not enough capacity is available to actually run this job as often as it needs to be. See 725936: changeprop-jobqueue: Increase concurrancy of DispatchChanges to 15 for how to increase it, if that is the problem.
The alert is currently set at 10 minutes (600,000 milliseconds) as 10 minutes is the SLO for the duration of the entire process from a change happening at Wikidata to it appearing in the Recent Changes in the client wikis. The typical backlog is between 0.5 seconds and 1 second.
NoData
If there is an alert that includes grafana_state_reason = NoData
, then go to DispatchChanges Job dashboard, make sure the correct data center (dc) is selected in the top-left, and then check the panel for "normal job backlog time (p50, 15min)" a bit further down.
Currently, there are a lot of false positives alerts with "NoData" as the reason. task T349178 has been created to look into that.