integrating SRE from day 1

There are a few big questions that always come up in software engineering: ‘What should I build?’, ‘How should I build it?’ and ‘How do I know what I built is working?’

There are also some tried and true methods that have emerged out of the tech landscape. For ‘What should I build?‘ we know that we can use the techniques outlined in Lean Startup to figure out the right thing to build. For ‘How should I build it?‘ we can use the methodology of Extreme Programming, or another framework that we like.

The interesting question here is ‘How do I know what I built is working?‘ We may think that the answer here is tests, but the trouble is, tests describe how our app should work. They verify – in a highly controlled test environment – that our app does in fact work that way. But tests don’t cover everything. Here are some problems that a typical test suite won’t catch:

Our app is not configured to re-connect to Redis when it gets disconnected
We have the wrong value for one of our production env vars and the app won’t be able to run
Our app can’t handle the amount of traffic it’s about to receive

The more robust answer to the question ‘How do I know what I built is working?‘ is site reliability engineering. SRE is the discipline through which we create scalable and highly reliable software systems, and also includes things like real-time observability of those systems.

The next question that comes up is ‘When do we bring in SRE?‘ When a project is just starting up with nothing built out, it can seem a bit superfluous. We need data and metrics to build out basic SRE, and yet we also don’t want to build our system blind – we want to understand how our system behaves as we’re building it. It’s a bit of a catch-22.

Most people start worrying about SRE when the ‘day 2’ problems are about to roll in, usually right before an internal or external launch. These problems include:

Getting paged at 2 am
Seeing the system strain as the company grows
Generally living with the consequences of the software choices we’ve made

The trouble is, by the time there’s an impending launch or those day 2 problems have started showing up, it’s already too late.

If we wait that long, the inevitable scramble begins and the backlog ends up full of logging and observability tickets. But much in the same way that it’s better to write tests as we write our code, it’s also better to build observability and the principles of SRE into our code and our project from the get go. Adding it as an afterthought leads to patchy coverage and weeks of ‘system hardening’.

Now that we see the need for incorporating SRE early on, the question becomes, ‘How do we bring in SRE from the get go?‘ So much of SRE requires data around uptime, metrics, etc – things that we don’t have when a project is just starting up. There are no users and no traffic, no way to get a baseline understanding. Or so it seems.

What follows is a guide on incorporating SRE from day 1 so that we don’t have to scramble once those day 2 problems start showing up.

treat metrics and logging as first-class citizens.

The first building block of any SRE strategy is understanding what our system is doing. This requires us to treat metrics and logging as first class citizens by including them as requirements in each story (no matter how early we are in the build process). We have to favor things like exposure logging early on, and set up our observability tools from day 1, much in the same way we would our CI/CD pipeline. That way, as the app grows, our observability naturally grows with it.

use actuator endpoints to create a baseline for uptime.

In order to make use of things like error budgets, SLAs, etc. we need a baseline understanding of our app’s uptime. We can establish this by making use of an actuator endpoint that runs through a critical code path in the app. This endpoint can query the database, exercise some service logic, or even make a mutative action (like a purchase) in the system, as long as it undoes that work as part of this flow.

Then, we can set up API monitoring for this endpoint by hitting it every X minutes and keeping track of what response we get back. We’ll consider a 200 response as the app being ‘up’, and a non-200 response as the app being ‘down’. When we graph this data over time, we now have basic observability and a baseline for uptime.

In this case, it doesn’t matter that we have no users yet – we’ve instead created our own traffic in the app, which gives us the data we need to start implementing further SRE practices.

create an error budget.

Error budgets are valuable because they clearly delineate the point at which a team will pause on feature work to work on service reliability. Now that we have our uptime baseline, we can start to introduce error budgets into the team process.

It’s also worth calling out that this is an ideal time for getting our product manager on board with this way of working. It’s much harder to convince a PM to abide by an error budget when the app is live and there are external pressures at play. By introducing the concept and getting the team on board early, that conversation happens at a time when there’s less project pressure and a greater likelihood of PM buy-in. Then, when the rubber hits the road and things are pressurized, the team is already used to thinking in terms of and working with an error budget.

The key to doing this well is holding ourselves to those error budgets even when the only user is our PM. We can incorporate data from the error budget into IPMs, and when we don’t consume that budget for a given period of time, we intentionally consume it by spending that time on SRE-related work and exploratory testing. This allows us to find the weak points in our app ahead of time so that we can fix the issues before they impact users.

define SLOs and SLAs.

Let’s define these terms quickly.

SLO: an internal objective on metrics like uptime, responsiveness, etc. (what we’re aiming for)
SLA: an agreement between provider and client on metrics like uptime, responsiveness, etc. (what we’re promising)

How do we incorporate these early on? First, we have to set some goals with the team. Even if we don’t have users, we can still keep track of how often we hit our goals. This primes the team to think about things like uptime, etc. not simply in the abstract but as something that is actively measured and that will eventually have a direct and tangible impact on users.

write postmortems.

Postmortems are vital to write once we’re in production, but they’re also a good practice to get into before going live. With that said, it can be hard to know what kind of bug or problem should necessitate a postmortem when an app is in its early state.

Ultimately, the answer to that question is up to the team. The clearest way is to define specific cases where we’ll write a postmortem, e.g.

When an issue drops a single day’s uptime by Y%
When an issue consumes X% of the weekly error budget
When something slowly drains the error budget over time

explore chaos engineering.

If the team loves this, we can keep going!

Once the team is primed to think about the application through the lens of SRE, a whole world of opportunities opens up to us. We have the space to introduce chaos engineering, game days, and a whole host of exercises that allow us to proactively stress our system to uncover issues.

By the end of all this, we can feel satisfied that the team is more than well equipped to deal with those Day 2 problems, without even having experienced them yet.