O’Reilly’s seminal book ‘Site reliability engineering’ (SRE) describes Google’s approach to building and running massive internet-scale sites. However, the term SRE applies equally to production services of every kind from .com e-commerce websites, older three-layer back-end database-centric services to clusters of thousands of micro-services.
SRE is a broad topic and most people would consider it covers the full hierarchy from monitoring, alerting including and live production experiments.
In this article we will focus on a decision that potentially impacts many of those layers. How best to define the metrics of site-reliability and why that’s a conversation you need to have with your product owners.
A case study
For a recent customer we built an authorization micro-service to intercept every customer request heading in to the digital world of a global retail bank. The idea is to check that each customer is permitted for the given request. A classic example is to move money between accounts the customer needs the correct access level to the source account. In terms of traffic volume for the full authorization service, it’s a few million per day.
Metrics are often seen as the engineer’s domain. Who cares what you measure, so long as the customer sees reliability and up-time? Product owners will likely raise business features ahead of non-functional aspects. It’s our job to raise the visibility of metrics to their level. To educate and inform them on the value these metrics will provide and have an active discussion on what makes a ‘good’ level of service.
Start with a ball-park
It made sense to begin this discussion with some ball-park parameters. How do we choose a useful metric for site-reliability of this micro-service? The obvious choice for our service is traffic per unit time, which is typically termed ‘throughput’.
A service-level indicator (is a target value of it. In our case in the early days our traffic was around a thousand per minute, and we anticipated growth so Our SLI of ‘authorization service throughput’, now had an indicator (SLO) of 50K requests per hour.
Refine and qualify
50K/hour. Let’s examine what that means. We’re stating the service can respond to 50K requests per hour. Yet we haven’t said anything about the quality of the service. What’s the request timeout of the client? What limits are in place at the firewall or other parts of the ingress stack? What’s an acceptable time-frame for an individual request? So naturally we define response latency as an integral part of this SLO.
Let’s then consider how our service might fail the SLO by breaching the SLI. When the service is just warming-up, is the same threshold still reasonable? It might be, in which case we may need extra care during rolling upgrades. For our service we defined a temporary lower SLO during upgrades which relaxes the response time to allow for service warm-up.
Should we include non-successful requests, such as bad requests or server-side failures? Typically, such requests respond more quickly than successful, however if we later use statistics from such metrics to drive behaviour we may skew the results. It’s a trade-off. On balance, given our expected 99% of requests to be successful we chose to capture only successful 2xx http status responses in the SLO.
Another consideration is the long-tail of the response time graph. No matter how well you plan your service dependencies, there will inevitably be slow responses. Database queries can miss an index. Garbage collection can fire and delay a socket response. Page swaps to disk. These are blips on the graph, and you don’t want to be alerted unless they are the start of some significant failure.
For our situation, we assumed a normal distribution of responses on the graph (it’s not) and empirically decided a second deviation of 95% as a reasonable confidence level.
Include Business Context
During early development of this particular service, as we evolved the software to add new features, we found several distinct ‘flavours’ of authorization which, for the benefit of this article, we can classify as simple and complex.
Simple calls could be handled in-service whereas complex calls often required delegation to other services. Complex requests would therefore quickly eat in to the 100 ms allowance with network hops to those dependents. So it made sense for us to measure these metrics separately, but still maintain the overall customer SLO.
You might also consider defining higher parameters at peak traffic times, though we had enough head-room in the performance tests to make this less relevant and, in any typical scenario, those peaks simply become the new higher overall targets.
That’s roughly how the discussion went with the product owner. Having the conversation meant the product owner could actively engage and shape the product’s non-functional behaviour.
The system captured metrics from log statements and reported a summary each day.
We found it that reports are self-describing. Each report should show the environment, specific SLO/SLI and whether the threshold is breached. That way, if you start to see a trend you can accurately target any remedial effort. The above report further decomposes by business context per row. We emailed reports to various stakeholders, and the contextual information was invaluable when they asked why an SLO had breached. At least we knew which environment!
Occasionally we ran some historic statistics to predict traffic growth and adjust our SLOs. We were seeing significant growth in traffic. Over 6 months we passed 2 million per day which was mostly growth in simple requests. We adjusted the 10/90% split, increased throughput to 100K and re-ran performance load tests at larger scale.
We ran performance test for services before going live and on significant releases. We used SLO parameters to define performance expectations. We now knew exactly what success looked like.
And more importantly simulated load stress provided behavioral patterns to look for in our production system. One example was to raise an alert on a combined CPU and memory level on the host virtual machines that hinted the system was growing hot.
So, in summary, we found it essential to discuss and agree your SLO/SLIs openly with your product owners so that SRE terms are in their language and are an integral part of the conversation alongside business features. Arm yourself with some basic parameters to begin with and expect to adjust them in the future.