MediaWiki Event Enrichment/SLO/Mediawiki Page Content Change Enrichment
Material may not yet be complete, information may presently be omitted, and certain parts of the content may be subject to radical, rapid alteration. More information pertaining to this may be available on the talk page.
Service
A real time data processing application that consumes the mediawiki.page_change.v1
topic, performs a lookup join (HTTP) with the Action API to retrieve raw page content, and produces an enriched event into the mediawiki.page_content_change.v1
topic.
Teams
The Event Platform value stream is responsible for this service.
Architectural
Environmental dependencies
Mediawiki Page Content Change Enrichment runs on k8s (wikikube) on codfw and eqiad data centers.
Mediawiki Page Content Change Enrichment requires the Flink Kubernetes Operator to be deployed on the host k8s cluster.
Service dependencies
Mediawiki Page Content Change Enrichment application consumes and produces from Kafka main clusters and produces to Kafka jumbo-eqiad cluster. The application issues HTTP requests to the Mediawiki Action API.
Client-facing
Clients
The service clients are consumers of the mediawiki.page_content_change.v1
stream. Clients will only interact with the service via that stream.
Service Level Indicators (SLIs)
- Enriched events percentage (availability): the percentage of
mediawiki.page_change.v1
events consumed that resulted in an enriched event being produced intomediawiki.page_content_change.v1.
This is the amount of events that have been successfully enriched. - Excessive topics lag: the percentage of time that kafka topic lags above a threshold (TBD). This SLI informs about eventual service latency, that would cause
page_content_change
messages to lag behindpage_change
.
Operational
Monitoring
Mediawiki Page Content Change Enrichment emits timeseries metrics (counter, gauges) for all SLIs. They are available in Grafana.
Troubleshooting
Mediawiki Page Content Change Enrichment depends on Kafka and the Action API. Operational errors are expected to be correlated to the performance of either system.
Mediawiki Page Content Change Enrichment emits errors (exceptions, invalid records, HTTP timeout after retries into a kafka error topic: <DC>.mediawki_page_content_change_enrichment_error
.
As of 2023-05, No support SLA is provided. File a Bug at https://phabricator.wikimedia.org/project/view/1474/ and the Event Platform team will follow up within 24 hours (on work days). In case of outage, deleting and re-applying the deployment is considered within SLO targets.
This may change once we 'release' the mediawiki.page_content_change.v1
stream, hopefully in early FY 2023-2024.
Deployment
The service is deployed with deployment-charts. See MediaWiki_Event_Enrichment#mw-page-content-change-enrich
Service Level Objectives
Realistic targets
A realistic target for availability would be 80% of processed messages are enriched, with no particular upper bound of latency.
A realistic target for excessive kafka topic percentage would be 80% of the time, the max lag is within the desired threshold (TBD).
Ideal targets
A realistic target for availability would be 99% of processed messages are enriched, with no particular upper bound of latency.
A realistic target for excessive kafka lag percentage would be 99% of the time, the max lag is within the desired threshold (TBD).
Reconciliation
Erroneous responses from the Mediawiki Action API, or changes on databases that can't be captured by EventBus hooks, will impact the availability of enriched events. MediaWiki Event Enrichment#mediawiki.page content change semantics describes failures scenarios for the enrichment application. While we expect retry-on-error logic to address the majority of API related issues, some of them might require clients to reconcile the stream.
There are known, sporadic, cases when database mutations will not result an event published to Kafka (e.g maintenance SQL script that UPDATEs a database). The enrichment application will not be able to handle those cases.
Explorative analysis on (backfilled) data we collected so far (June 2023) suggests that significantly less than < 1% of events are impacted by failures that will require reconciliation.