Jump to content

SLO/Template instructions/Architectural

From Wikitech

What are the service's dependencies?

Definition: Your service has a dependency on another service if your service can't work correctly when that service isn't working.

Hard and soft dependencies

Every dependency is either a hard dependency or soft dependency. If any of your hard dependencies is completely broken, then your service is completely broken. If a soft dependency is completely broken, then your service operates in a degraded mode, offering either reduced features or reduced performance.

Examples: MediaWiki has a hard dependency on MariaDB; if the core database is unavailable, MediaWiki can only serve errors. But MediaWiki has a soft dependency on Thumbor: for the duration of a Thumbor outage, thumbnails won't appear for images newly added to wiki articles, but everything else will work normally.

Because your service can't work when a hard dependency is broken, it's impossible for your availability to be higher than theirs. If you're waiting for a response from them in order to serve a response of your own, it's impossible for your latency to be lower than theirs. Thus, your dependencies' SLOs create a boundary on what yours can be.

Direct dependencies and proxies

The most common type of dependency is when your service sends a request to another service, and uses the response to do its own work. If it serves you an error, or exceeds its latency deadline, you have no choice but to serve an error yourself (hard dependency) or do your best without it (soft dependency). We'll call this type of relationship a direct dependency.

However, not every client-server relationship creates a dependency. If your service is a proxy, then it's doing its job correctly when it faithfully proxies an error message. (There's still an end user having an unsatisfactory experience, so error budgets should be consumed both upstream and downstream of your proxy, but the proxy itself is healthy.)

Distinguishing between a direct dependency and a proxy relationship can be nontrivial, since some proxies cache, mutate, or otherwise act on the response. To tell the difference between the two, ask whether your client cares where you send your traffic. MediaWiki could replace Thumbor with some other thumbnail-generating system, and MediaWiki's clients wouldn't mind as long as thumbnails continued to work. But clients of Envoy, Varnish, or ATS all have a specific destination in mind for their traffic.

This distinction affects your choice of SLIs. In a direct dependency, every error you serve to your users counts against your error budget, even if the error was your dependency's "fault," and likewise time spent waiting for your dependency counts as latency to your users. If this makes it impossible to meet your user-driven SLOs, you may need to reconsider your architecture: this service may be insufficiently reliable to depend on.

But for a proxy, a more reasonable SLI for error rate might be "percentage of requests which yielded an error response not proxied from the backend," and the latency SLI might exclude the time spent waiting for the backend. These comparably forgiving definitions are offset by much tighter targets: since your backend's errors don't count against your error budget, you shouldn't need as large a budget.

Indirect dependencies

Remember, a dependency is when your service can't work correctly when another service isn't working. It's possible for this relationship to exist even if your service doesn't send requests to the other, which we call an indirect dependency. One form of indirect dependency is a capacity cache, where a cache enables you to operate your service with less hardware by deduplicating work.

Example: Consider the ATS backend cache, which acts as a capacity cache in front of the application servers. Over 90% of incoming web requests are handled in the CDN without being proxied to MediaWiki at all, and as a result the app server fleet is only provisioned to serve a small fraction of the total load. If an incident in ats-be led to all traffic being forwarded -- for example, imagine a buggy ATS configuration that treats too many kinds of requests as "pass," i.e. uncacheable -- the resulting avalanche of traffic would overwhelm the app servers, causing a complete outage.

Thus the application layer's ability to serve correctly depends on the CDN layer doing the right thing. Surprising conclusion: even though ATS doesn't have a direct dependency on MediaWiki (due to being a proxy), MediaWiki has an indirect dependency on ATS! But that's okay: a failure in that form would be catastrophic, but is sufficiently low-probability -- or, in other words, ATS's objective for "percentage of cacheable requests not cached," if it had such an SLI, is sufficiently high -- that this isn't a significant concern relative to MediaWiki's other failure modes.

📝 List all dependencies, including links to their respective SLOs if applicable. You don't need to list your dependencies' dependencies, unless you also depend on them directly. For soft dependencies, also describe the expected degradation of service during their unavailability.

Next, move on to Client-facing.