Jump to content

SLO/Template instructions/Service level indicators

From Wikitech

Service level indicators (SLIs) are the metrics you'll use to evaluate your service's performance. An example of an SLI is "Percentage of requests that receive a non-5xx response." The full set of SLIs, combined with numeric targets that we'll select later on, comprise the SLO.

SLIs should be:

  • directly client-visible. Measure symptoms, not causes: each SLI should reflect the client service's (or user's, if user-facing) perception of service performance. It should be impossible for an SLI to significantly worsen without any clients observing a degradation in service.
  • comprehensive. It should be impossible for a client to observe a degradation in service without any SLIs significantly worsening, unless some element of the service is intentionally not covered by the SLO. (For example, a logs-processing system could measure the percentage of items processed eventually but make no guarantees about how quickly, in which case processing latency might not be one of its SLIs.)
  • under your control. The purpose of your SLO is to help you know how to prioritize reliability work. That means that if the performance measured by your SLIs declines, you should be able to identify engineering work to improve reliability; there shouldn't be SLIs that you're powerless to affect. This won't be absolute, because your dependencies' reliability will always be able to affect your own. As a thought exercise, suppose that all your dependencies meet their SLOs, but you still measure a decline in your SLI (considering each SLI in turn). If that scenario is either mathematically impossible or would not be actionable, the SLI may not be useful.
  • aligned with overall service health. Consider the developer motivations that will emerge from your SLIs. Avoid perverse incentives. For example, if the only latency SLI is median request latency, then 50% of requests have no coverage in the SLO. That would incentivize developers to disregard tail latency, even though it may be key to user perception of service quality.
  • fully defined and empirically determined. Starting from a common set of data, parties should always agree about how to calculate the SLI. Try to avoid ambiguity: does "99% of traffic" mean 99% by request count, or 99% of bytes? Does a day mean any 24-hour period, or a UTC calendar day? Does request latency mean the time to first byte or last byte? Does it include network time?

Sometimes, multiple service characteristics can be combined into one SLI. For example, a service could define a Satisfactory response as being a non-error response served within a particular latency deadline. Then a single SLI, defined as "percentage of eligible requests which receive a Satisfactory response," captures both errors and latency. A spike in either latency or error rate would impact the service's availability as measured by this SLI. This approach is well suited to services whose clients have particularly sharp latency requirements, such that if the server takes longer than a certain period to respond, it might as well not have responded at all.

Some standard options for SLIs are listed below. If possible, you should prefer to copy them into your SLO, rather than writing your own from scratch.

Choose the ones that make sense for your service and for the available monitoring data; don't take all of them. For example, the different latency SLIs are alternative formulations of the same idea, so keeping more than one would be redundant.

  • Latency SLI, percentile: The [fill in]th percentile request latency, as measured at the server side.
  • Latency SLI, acceptable fraction: The percentage of all requests that complete within [fill in] milliseconds, measured at the server side.
  • Availability SLI: The percentage of all requests receiving a non-error response, defined as [fill in, e.g. "HTTP status code 200", or "'status': 'ok' in the JSON response body", etc].
  • Combined latency-availability SLI: The percentage of all requests that complete within [fill in] milliseconds and receive a non-error response, defined as [fill in as above].
  • Proxy latency SLI, percentile: The [fill in]th percentile of request latency contributed by the proxy, excluding backend wait time.
  • Proxy latency SLI, acceptable fraction: The percentage of all requests where the latency contributed by the proxy, excluding backend wait time, was within [fill in] milliseconds.
  • Proxy availability SLI: The percentage of all requests receiving a non-error response, or where the proxy accurately delivered an error response originating at the backend. Errors originating at the proxy are measured as [fill in, e.g. "HTTP status code 503"].
  • Proxy combined latency-availability SLI: The percentage of all requests where the latency contributed by the proxy, excluding backend wait time, is within [fill in] milliseconds and the request receives a non-error response or an error response originating at the backend. Errors originating at the proxy are measured as [fill in as above].
📝 Copy a selection of appropriate SLIs into your document, and fill in the blanks. If necessary, add (and fully define) any other SLIs appropriate to your service -- and consider adding them here if they may be useful to others. In all cases, if they can be measured in Grafana, link to a graph for each. Don't set numeric targets yet; we'll think about that next.

Next, move on to Operational.