Uptime. Are you monitoring it correctly?

ByAashish Bajpai
August 31, 2024

Uptime is a very common jargon used in the SRE world. SLA is probably a guarantee provided for some level of availability. What exactly is uptime and how do you measure it?

Answers to this may vary depending on the goals for anyone. Softwares are complex. Higher levels of abstraction add complexity or in other words, simplicity introduces another kind of complexity. For example:

If your service returns 200/OK does that mean it’s up?
If a customer tried to reach your service when it was down, was it really down?
Your service works for one customer and it doesn’t work for another customer. Is it really up?
Your service takes more than 10 sec to respond. Is it really up?
You accepted a request and acknowledge it but dropped it on the way, are you really up? What if it retries and succeeds a second time? What if it succeeded after 5 retries and the user never noticed it as it was in the background?
What if the user received a timeout and believed it failed but some work happened in the backend and writes to the database were successful?
What if your application successfully returned a response within SLA limits but it turned out to be an incorrect or stale response? Another side effect of excessive caching in the run of maintaining availability and performance.

We may have different perspectives and answers to this but end of the day the only thing that matters is the user experience. If your users are happy and think that it is up, it is up. And if all of your engineering efforts are still not able to provide a good user experience for a stable application, it is down.

You might be tracking your uptime by using traces at the load-balancer level and calculating it based on the ratio of successful and total requests but wait – what is a successful request for you? Is it 200 OK? What if you got a couple of 5xx but your application was still able to serve users efficiently? This is exactly why you should clearly define SLA to your users considering all these situations.

Think about different variables

Are you sharding your data based on any variable like userId or accountId? Then it may become even more complex. You may have different availability numbers for different user paths. Your availability tests might be calculating uptime for a 100% healthy shard but with many other unmonitored unhealthy shards.

Features

Today is a time of high-speed development and there may be multiple builds happening in the day. If a feature provided by you is being used by any customer, it must be monitored for uptime calculation. I have seen many organizations overlooking other features in the run of providing better user experience for critical features. Yes criticality of the feature matters for uptime calculation but criticality is also subject to change with time.

Who is the culprit? Is it a cloud provider?

Your customer really doesn’t care if you are using AWS, Azure, or an on-prem server to host your application. Resilience is your responsibility. Doesn’t matter if it failed at the DNS level or infra level. Should you feel responsible? Well, you should because you choose the vendor.

A user sitting from one geographic location might be able to use your site but a user from another location may not be able to use your application due to various reasons. Cloud-specific issue? Well, go for multi-cloud architecture. Network issues? Go for your dedicated network paths. All of these may add their own cost. End of the day it is a trade-off that you need to define based on your goals.

reliability SLA SRE uptime

Too dependent on percentiles? Read this.

ByAashish Bajpai

-September 1, 2024