Performance/Alerts

Alerts in Grafana

We use Grafana for alerting on performance regressions. We collect performance metrics from our real users and store it in Graphite and Prometheus. We also collect performance metrics from synthetic performance tools in Graphite. We use both types of metrics and send alerts on those.

History

When we started out we only used RUM to find regressions. Back then (and now) we use https://github.com/wikimedia/mediawiki-extensions-NavigationTiming to collect the data. We collect metrics from a small portion of the users and pass on the metrics to our servers that later ends up https://graphiteapp.org/ and Prometheus. We collect Navigation Timing, a couple of User Timings, first paint, First Contentful Paint and Largest contentful Paint and CPU long tasks for browsers that supports that.

Finding regressions

The way we found regressions was to closely look at the graphs in Graphite/Grafana. Yep watching them real close. The best way for us is to compare current metrics with the metrics we had one week back in time. The traffic and usage pattern for Wikipedia is almost the same if we compare 7 days. Comparing 24 hours back in time can also work, depending on when you look (weekend traffic is different).

Did we find any regressions? Yes we did. This is what one looked like for us:

First paint change found on Graphite GUI

Looks good right, we could actually see that we have a regression on first paint. What is kind of cool is that the human eye is pretty good at spotting differences between two lines.

But we moved on to use alerts in Grafana to automate how we find them.

Current setup

At the moment we alert on WebPageReplay, direct synthetic tests, Navigation Timing metrics, increase in CSS an JavaScript size, Save timings and if synthetic tools are down/not working.

Alerts

We have alerts for RUM and synthetic testing. We have different ways of alerting depending on the tool and how we collect the data.

WebPageReplay

We run tests with alerts on enwiki, group 1 (it) and group 0. The idea is to see performance regressions early.

There are three dashboards today for these alerts. The tests test Wikipedia on desktop and using emulated mobile mode for testing m.wikipedia (the plan is to move those tests to real Android devices instead):

These alerts works like this: Every Sunday we take baseline. Through out the coming week, we test our metrics against that baseline using Mann Whitney U and Cliffs delta. Mann Whitney U is a statistical test that can find difference in metrics even if the data is not following normal distribution (performance tests often has outliers etc). If a performance regression is pushed, and Mann Whitney U signal a statistical change, the alert is fired.

Cliffs delta measures the difference between two groups of metrics (baseline and our test) and categorise the change as small, medium or high change. That helps us to make sure we do not fire on small alerts, today our alerts fire on medium changes. . The next sunday, we will take a new baseline. You can trigger new baselines either by pushing a configuration change to git, let all tests create a new baseline, and push another change to git. Or manual configured new baselines on the server that runs the tests.

For enwiki on desktop we test eight URLs, on en.m three URL , group 1 five URLs. Depending on the content on the page it can be easier/harder to see regressions so running tests on many pages are good and we are sure that we find things regressions that affects many users.

Our WebPageReplay alert setup is explained more in Performance/Guides/WebPageReplay alert.

Direct synthetic tests

We also run tests direct against Wikipedia. These test tests three URLS and look at the median. The alerts are setup to fire alerts if the median increases by X ms. The threshold is a hard limit. This means if the alert fires, it will continue to fire until the metric is below the threshold or you change the alert.

We alert on changes on enwiki and en.m.wiki using direct tests. For an alert to fire, all three URLs need to be above the threshold for that URL.

Looking at medians are used to be our default way BUT we could have regressions that are not statistically significant and we can fire false alerts (we actually do/did). Metrics for a URL can change over time of the content change on the page, and the limit is set hard when we created the alerts. That means every alert that is fired, you need to look at the metric and see if it affected all URL and is a valid alert. You also manually need to set a new hard limit if the pages has a new baseline. The alerts for direct tests is found at https://grafana.wikimedia.org/d/000000318/performance-synthetic-direct-test-alerts

Navigation Timing

Real user alerts have a couple of different setups. Either we compare back in time to the metrics value 7 days ago and alert if the metrics has increased by X ms. For the metrics that exits in Prometheus we alert if % of users that are in a specific metric bucket increased (for example we have 1% more users that have a first paint slower than compared to last week).

We alert on our RUM metrics. We alert on first paint, TTFB and loadEventEnd. We set the alerts on p75 and p95 of the metrics we collect and alert on a 5-30% change depending on the metrics. Some metrics are really unstable and some are better. You can see our RUM alerts at https://grafana.wikimedia.org/d/000000326/navigation-timing-alerts

Then there are alerts that use data stored in Prometheus. These alerts haven't been really tested by any team since they where done when the performance team was closed down. These alerts look at buckets and amount of users in % in those buckets. For example we alert on if we have an increase in users with a first paint slower than 1 second compared to last week.

Before Graphite is closed down (https://phabricator.wikimedia.org/T228380) we need to look at the current Prometheus alerts, and more alerts and make sure we alert on the things we think is necessary so that we don't miss out on anything with the sunsetting of Graphite.

Synthetic tools status alerts

We alert if our tools are down or are not working as expected. If an alerts fires, please go to https://grafana.wikimedia.org/d/frWAt6PMz/performance-synthetic-tool-alerts and see what's failing.

Save Timings

// TODO

JavaScript and CSS increase

We alert on increased JavaScript and CSS sizes. We test three URLs and have hard limits on when to alerts. These limits are set individually per URL and if all three URLs are above the limit, the alert will fire. We have the alerts to be able to find generic increases that affects all URLs. That means we need to tune the alerts if the size for one of the pages increases.

Increased JavaScript sizes will also increase the number of CPU long tasks. For increased CSS it can potentially increase the first paint/first visual change and related metrics.

We only run these tests on en.wiki and this is the dashboard.

Console errors

We alert on console errors (JavaScript errors) on the user login journey, mediawiki.org and it.wikipedia.org (both mobile/and desktop). As soon as one of the tested pages has a JavaScript error an alert is fired.

The alerts exist on the direct test dashboard.

Alert emails

The alerts in Grafana will trigger an email. At the moment the most synthetic tests alerts (except the ones from the web teams dashboard) is sent to the team qte (as setup in Grafana). The alerts are four different groups:

Alert that fires if the tools aren't working correctly (fire on missing data). If we get that kind of alert, you need to investigate which tool/server that is down and fix it.
Alert fires from WebPageReplay tests. This is a regression, have a look what if you can understand why and talk to the web team
Alert fires from the direct tests. The direct tests is a little more unstable, check the graphs for the tests, is there a recent increase for all the tested URLs? If so, it's a regression and talk to the web team and check if they know about it.
Alert fires for increased JavaScript or CSS: The thresholds are sometimes too small, check the graphs. Is there an increase that is high (+a couple KB), check with the web team if they are aware of it.

Known problems

There's a couple of problems we have seen so far.

Self healing alerts

For some tests we go back X days back (usually 7 days back). That means that after 7 days, the alert is self healing (we will then compare with the metric that set off the alert).

Known non working queries

We have had problems with nested queries that works in the beginning but then stopped working (using Graphite built in percentage queries). To avoid that we now do alert queries like this:

Create one query that goes back X days and make that hidden. Then make another query that divides with the first one and set the offset to -1. It looks like this:

Creating an alert

In 2021 we moved to use AlertManager in Grafana. To setup an alert for AlertManager you need to make sure you add AlertManager in the send to field in the alerts.

You also need to add tags to your alerts. To make sure the alerts reach the performance team you need to add two tags: team and severity. The team tag needs to have the value perf and the severity tag should have the value critical or warning (depending on the severity of the alert).

We also use tags to add extra info about the alerts. The following tags are used at the moment but feel free to add more:

metric - the value shows what metric that fired the alert. By tagging the metric, its easier to find other alerts that also fired at the same time
tool - what tool that fired the alert. Values can be rum (real user measurements), webpagetest, webpagereplay, sitespeed.io.
dashboard - the link to the current alert dashboard. At the moment Grafana/AlertManager aren't best friends so you need to give the link to the dashboard so you can find the actual dashboard directly from the alert. This is important, please add the link.
drilldown - link to other dashboard that hold more info about the metric that fired. For example a dashboard that shows all metric collected for a URL that failed or RUM metrics that give more insights to what could be wrong.

In the future we also want to include a link to the current runbook for each alert.

The tags will look something like this when you are done:

When you create your alert you need to make sure that it's evaluated often so that the AlertManager understands that it is the same alert that fires. At the moment we use evaluate every 3 minutes. If you don't evaluate often, alerts will fire and you will give multiple emails for the same alert.