SRE/Observability/OKR
FY2024/2025 Q1 Hypothesis
If we disable unused Graphite metrics, target migrating metrics using the db-prefixed data factory and increase our outreach efforts to other teams and the community in Q1, then we would be on track to achieve our goal of making Graphite read-only by Q3 FY24/25, by observing an increase of 30% in migration progress.
Context
We have been using Prometheus in production for several years as it offers several benefits over Graphite. Migrating MW off Graphite ensures we stay ahead with a supported, scalable metrics platform for more effective, multidimensional metrics analysis and storage. Prometheus provides more robust data labeling, storage, and query capabilities. This initiative is fundamental in unifying our metrics, enhancing monitoring, improving MW observability, and reducing tool fragmentation.
Last year, the team set out to test whether a new interface was viable and determined that long-term sustainability required us to migrate MediaWiki metrics to Prometheus, utilizing StatsLib, a new, internally developed, Prometheus-capable metrics interface. By the end of Q2, the team had successfully tested the component in production, and by the end of Q4, it had advanced about 41% along the migration. see Graphite metrics volume migration dashboard
As the WMF improves its culture around MW ecosystem sustainability, we are setting our goals to complete the migration of active, production, and in-use (by dashboards/alerts) metrics to Prometheus to enable read-only mode on the Graphite cluster by the end of Q3 FY 2024/2025.
For this exercise, we define as “in-use” any metric emitted to graphite mapped to a dashboard panel or alert active in Grafana. see Graphite Utilization Dashboard
Community-driven participation is crucial in this effort. This project also facilitates the improvement of our production metrics infrastructure and helps us deprecate legacy systems.
RFC on Prometheus as a better interface for MW metrics: T249164.