notes / availability /slos-slis-error-budgets

· Brittany Ellich · notes  · 5 min read

SLOs, SLIs, and Error Budgets

It would be nice if every application were available 100% of the time, but balanced software engineers understand that perfection is not a realistic goal. As with everything in software, there are trade-offs, and folks need to understand that the trade-offs to get to 100% availability probably aren’t worth it.

“When organizations stop aiming for perfection and accept that all systems will occasionally fail, they stop letting their technology rot for fear of change and invest in responding faster to failure.”

-Marianne Belotti (Kill It with Fire: Manage Aging Computer Systems (and Future Proof Modern Ones))

SLO (Service Level Objective)

An SLO is a target or a goal that defines the desired level of availability. It can also be used to measure latency, throughput, or error rates of a service. The SLO is typically expressed as a percentage (ex. 99.9% uptime) and represents the level of service the team aims to provide for users.

  • Purpose: SLOs are used to define what acceptable service availability looks like, guiding both the development and operations of the system.
  • Example: An SLO might be that an application should be 99.95% available

SLI (Service Level Indicator)

An SLI is a specific metric used to measure the performance of a service against the defined SLO. It is an objective measurement of how well a system is performing in relation to the desired characteristics. If a SLO is the goal, the SLI is how you measure it.

  • Purpose: SLIs provide the data needed to evaluate whether the service is meeting its SLOs. They are often collected continuously and are used as a way to track service health over time.
  • Example: The SLI for the availability of an API could be the percentage of requests that are served vs. the number of overall requests.

Error Budgets

“This is the amount of temporary service degradation that’s deemed acceptable for users. So long as this error budget isn’t exceeded, then riskier deployments – which are those more likely to break the service – might be fine to proceed with. However, once the error budget is used up, pause all deployments that are considered risky.”

-Gergely Orosz (The Software Engineer’s Guidebook)

An Error Budget is the amount of failure allowed for a SLO. It is the difference between perfect availability (100%) and the service’s SLO target. The error budget defines the maximum amount of downtime or failures a service can have while still meeting the SLO.

  • Purpose: The error budget allows for some level of failure, acknowledging that it’s not always realistic to achieve 100% availability. Teams can use the error budget to make decisions about whether to focus on improving availability or shipping new features.
  • Example: If an SLO is 99.5% uptime, then the error budget is the remaining 0.5% of downtime. This might translate into a certain number of minutes of downtime per month or year. If the system has used up too much of the error budget, the team might prioritize availability work over new feature development.

Two panel comic, the left panel shows an error budget that has 40 minutes used out of 3 hours and 37 minutes available, corresponding with 99.5% availability, which says that there's nearly 3 hours of error budget left this month, so we can do the risky change! The right panel shows 40 minutes out of 43 minutes used for 99.9% availability, which says we've nearly used the error budget this month so we can't take any risks.

How They Work Together

  • SLIs are the metrics that reflect system performance.
  • SLOs are the goals or targets for these metrics.
  • Error Budgets quantify how much deviation from the SLO is acceptable before action is needed.

In practice, teams monitor SLIs to determine whether they are meeting their SLOs. If they aren’t, they can track the consumption of the error budget and decide whether to prioritize availability improvements or new features based on how much of the budget has been consumed. If too much of the error budget is consumed, the team might pause feature development to focus on improving the service’s availability.

Example

Let’s say you run a web service with the following definitions:

  • SLO: The service must be available 99.9% of the time.
  • SLI: Percentage of time that the service was available within a month.
  • Error Budget: If the SLO is 99.9%, the error budget is 0.1% of downtime (meaning the service can tolerate up to 0.1% of the time as downtime, which translates to around 43 minutes per month).

If the service exceeds this error budget (such as an unexpected outage), the team might focus on resolving the performance issues instead of adding new features.

This framework helps teams manage availability in a balanced way, ensuring they can meet user expectations while still evolving the system.

How do you define SLOs?

That sounds great, but how do you determine what your SLO should be? It’s generally not a great idea to use historical availability, because that can end up being more strict than necessary if you’ve been really lucky with few outages in the past.

There are a few things to consider for your application:

  1. If you have multiple applications to consider, start with one instead of trying to define SLOs for every application you own. You can add more later but getting the first one is the hardest part!
  2. Think about who your user is and what they will expect from the application in terms of speed and availability.
  3. Consider the critical paths and common ways that users interact with your application. That can help you determine what is the most important to keep available to users.

Resources

Back to Notes