Jump to content

SLO/Template instructions/Service level objectives

From Wikitech

The reporting period is the time interval over which you assess your service's performance against its SLO, and determine whether or not you met your objective. Although you'll continuously monitor your SLO, the binary success/failure result for the reporting period can be an input to your decisions about work prioritization: if you're already at risk of missing your SLO, you might delay risky deployments and focus on reliability improvements for the remainder of the period.

Every service at the Wikimedia Foundation uses the same reporting period: three calendar months, phased one month earlier than the fiscal quarter. Thus the four SLO reporting quarters are:

  • December 1 - February 28 (or 29)
  • March 1 - May 31
  • June 1 - August 31
  • September 1 - November 30

By reporting on SLOs every quarter, we can align with the existing cycle of planning and executing work, which we do at the Foundation with quarterly OKRs. Thus a service experiencing reliability problems in one quarter can prioritize efforts to correct them in the next quarter. The one-month offset allows us time to make that determination: if the SLO reporting quarter ended the day before the fiscal quarter starts, we wouldn't have time to review SLO performance and take it into account when setting OKRs.

Calculate the realistic targets

Using monitoring data for each of your draft SLIs, review your service's past performance. If everything stays about the way it is now, what's the best performance you can achieve?

Review the list of dependencies you made earlier. Suppose that each of your dependencies exactly meets its SLO. (For dependencies without an SLO yet, or dependencies that habitually miss their SLO, assume that they maintain their historical performance, or worsen slightly but not dramatically.) For each of your draft SLIs, what's the best performance you could achieve?

Suppose that during the reporting period, your team accidentally merges one major code or configuration bug. Assume that automated monitoring detects the impact immediately when it's deployed to production, that the first responding engineer immediately decides to roll back the change, and that the rollback process works normally. Review your estimates earlier for how long it would take to resolve the incident. For each of your draft SLIs, what's the best performance you could achieve?

📝 Fill in a realistic target for each SLI and, if applicable, each request class. If you base your results on any assumptions not already discussed, state them. This is not your final SLO, just one side of a bounding range; it's okay to approximate.

Calculate the ideal targets

Review the list of clients you made earlier. For clients with an SLO of their own, what level of service would they need you to provide in order to meet their SLO? For clients without an SLO (including end users who call your service directly) what level of service would they consider basically satisfactory?

For example, if your 75th-percentile latency went up by 10%, would the effect on your clients be such that you would deprioritize other work to restore it? By 20%?

📝 Fill in an ideal target for each SLI and, if applicable, each request class. If you base your results on any assumptions not already discussed, state them. This is not your final SLO, just the other side of the bounding range; it's okay to approximate.

Sidebar: Why isn't the ideal target 100%?

Errors are bad, right? So why shouldn't your error budget be zero?

It's always good to strive for perfection, but it's unrealistic to plan on it. As a rule of thumb, each additional nine of actual measured availability requires about the same amount of engineering effort: it takes roughly as much work to get from 99% to 99.9% as to improve further to 99.99%, and so on. But for any given service, there's a point of diminishing returns, where the extra sliver of availability is of limited practical benefit, and all that engineering effort would be better spent on other goals, like building new features or resolving technical debt.

Some projects do require 100% availability. In some engineering systems, even a single error can have life-threatening consequences, or disastrous financial cost, or can cause irreparable harm such as leaking users' secrets. Systems like this are possible, but require a different class of effort: their architecture, design, implementation, deployment, and operation are all handled differently and with orders of magnitude more work. For example, NASA coding standards require static upper bounds for every loop and prohibit all recursion, tightly limiting even benign code in order to minimize the potential for certain classes of bugs. The extra effort is justified by the high cost of exceeding a 0% error rate.

At the Wikimedia Foundation, we have no such projects. It's always better not to serve errors, but none of our services have a catastrophic failure mode. By forgoing a 100% availability target, we accept the chance of some volatility in order to free up engineering resources for better use, and by writing down a specific sub-100% target, we ensure that all parties are in agreement on what level of unreliability would require that we prioritize engineering work to correct it. We enable services to form realistic, specific expectations of their dependencies, and to design and operate with those expectations in mind.

Reconcile the realistic vs. ideal targets

Now that you've worked out what SLO targets you'd like to offer, and what targets you can actually support, compare them. If you're lucky, the realistic values are the same or better than the ideal ones: that's great news. Publish the ideal values as your SLO, or choose a value in between. (Resist the urge to set a stricter SLO just because you can; it will constrain your options later.)

If you're less lucky, there's some distance between the SLO you'd like to offer and the one you can support. This is an uncomfortable situation, but it's also a natural one for a network of dependent services establishing their SLOs for the first time. Here, you'll need to make some decisions to close the gap. (Resist, even more strongly, the urge to set a stricter SLO just because you wish you could.)

One approach is to make the same decisions you would make if you already had an SLO and you were violating it. (In some sense, that's effectively the case: your service isn't reliable enough to meet its clients' expectations, you just didn't know it yet.) That means it's time to refocus engineering work onto the kind of projects that will bolster the affected SLIs. Publish an SLO that reflects the promises you can keep right now, but continue to tighten it over time as you complete reliability work.

The other approach is to do engineering work to relax clients' expectations. If they're relying on you for a level of service that you can't provide, there may be a way to make that level of service unnecessary. If your tail latency is high but you have spare capacity, they can use request hedging to avoid the tail. If they can't tolerate your rate of outages in a hard dependency, maybe they can rely on you as a soft dependency by adding a degraded mode.

Despite the use of "you" and "they" in the last couple of paragraphs, this is collaborative work toward a shared goal. The decision of which approach to take doesn't need to be adversarial or defensive.

You should also expect this work to comprise the majority of the effort involved in the SLO process. Where the earlier steps were characterized by documentation and gathering, here your work is directed at improving the practical reality of your software in production.

Regardless of the approach you take to reconciliation, you should publish a currently-realistic SLO, and begin measuring your performance against it, sooner rather than later. You can publish your aspirational targets too (as long as it's clearly marked that you don't currently guarantee to meet them) so that other teams can consider them in their longer-term planning. In the meantime, you'll be able to prioritize work to keep from backsliding on the progress you've already made.

📝 Clearly document any decisions you made during reconciliation. Finally, clearly list the agreed SLOs -- that is, SLIs and associated targets. There should be as many SLOs as the number of SLIs multiplied by the number of request classes -- or, if some request classes are ineligible for any guarantee, say which.

Next, move on to Finalizing.