Jump to content

SLO/Template instructions/Operational

From Wikitech

Every service experiences an outage sometimes, so the SLO should reflect its expected time to recovery. If the expected duration of a single outage exceeds the error budget, then the SLO reduces to "we promise not to make any mistakes." Relying on such an SLO is untenable.

Answer these questions for the service as it is, not as it ought to be, in order to arrive at a realistically supportable SLO. Alternatively, you may be able to make incremental improvements to the service as you progress through the worksheet. Resist the temptation to publish a more ambitious SLO than you can actually support immediately, even if it feels like you should be able to support it.

How is the service monitored?

Assuming that an outage ends when engineers mitigate or work around the underlying issue, you expect the outage to last at least as long as it takes someone to notice and respond to the outage. If all SLIs are monitored with paging alerts 24x7, this is the expected time between the time when the outage starts and the time when a responding engineer is hands-on-keyboard investigating it. (Remember to include any delay associated with the alert itself, such as a sampling interval or rolling-average window.)

If some SLIs don't have paging alerts, this period is likely much longer. For example, if some element of the service is only monitored during working hours, a breakage on Friday evening might go unnoticed all weekend. A single such outage would cause the service to miss a 99% quarterly uptime SLO.

How complex is the service to troubleshoot?

After an engineer begins working on the problem, how long will it take to identify the necessary mitigative action? This is the least scientific question on this worksheet; it will likely be informed in part by experience.

Questions to consider: Does the team receiving pages for the service also fully understand its internals, or will they have to escalate to the developers and wait for help? If the engineer responding to the page is relatively inexperienced, can they still find all the information they need -- how to interpret monitoring, diagnose problems, and take mitigative action -- in documentation that‘s complete, up-to-date, and discoverable?

How is the service deployed?

Production incidents are often resolved by rolling out a code or configuration change, so a slower deployment process means a slower resolution. If the normal rollout is intentionally slowed by canary checks, it's reasonable to assume here that they're skipped for a rollback to a known-safe version, as long as such a process exists.

📝 Answer all the operational questions realistically, explaining how long you expect each phase of an outage response to last in the ideal case, and explaining why you think so. Consider linking to past incident reports to use as a comparison.

Next, move on to writing the service level objectives.