As an engineer, we often need to analyse multiple statistics for getting idea of system behavior under different traffic conditions. Sometimes we do it for staging, Sometimes for production.
Now when we talk about data analysis then different functions come into the picture i.e Mean, Median, Averages, Standard Deviation, Percentiles. Out of all these functions, Percentiles and Averages are the two things which are being used at most and are more popular. Many times when we develop a monitoring dashboard or create a report, customer ask for 90th percentile. This is common for most of the monitoring system we have like Graphite, Prometheus. This post is all about how percentiles can trick you in some conditions.
Why averages are confusing?
Over the last few years, people have accepted averages without conducting much deeper inspection. Averages can be misleading when it is about monitoring. If you are looking at averages, then most of the time, you are ignoring outliers, which might matter a lot. Some common issues with averages are –>
- Averages hide outliers, so you don’t consider those outliers for your performance analysis.
- Outliers skew averages. so in a system with outliers, averages doesn’t represent typical behavior.
So when you consider averages of a system with erratic behavior, you consider worst of both scenarios. Neither you get the typical behavior since that is skewed because of outliers, nor unusual behavior since you ignore outliers.
Why we go for percentile?
Percentiles are supposed to be the most common solution to overcome problems associated with averages. Let’s say when we go for 90th percentile of a web page load in production, then we interpret this as 90% of page loads are having less value than 90th percentile, rest 10% samples are having greater page load time than 90th. We go for percentile because –>
- It represents the worst experience. For example, 90th percentile is the worst experience of 90% hits of overall hits.
- Freedom of choosing outliers. We can go for 90th, 95th, 99th based on sample hits. We can include/exclude outliers according to our optimization goals & sample hits.
Now we think to calculate percentile values, put them in a time series database since averages are bad and percentiles are great. Right?
Problems with percentile
Mostly when we use percentiles with any time-series database, we don’t get what we expect and that causes wrong interpretation. Any time-series database works by storing aggregate metrics of data for a time range, not considering all events. They do this to overcome performance challenges with the big data problem.
- Averaging of data in a time-series database is implicit. Bases on your resolution, time-range or zoom in area, time-series database will aggregate data points. Each pixel in chart would represent average of multiple hits. This is implicit and hidden from users.
- We archive and keep the old data at a lower resolution in the time-series database when we have a huge amount of data. Based on our selection, database returns that lower resolution data.
Averaging of percentile doesn’t work since the whole idea of having percentile it to include outliers and consider unusual behavior of system as well along with typical behavior. There is a complete breakdown in the math here. We strongly need to consider all raw events for percentile calculation. An average of percentile is useless.
Percentiles are very expensive to compute. Answer for this lies in process to compute it. We most importantly, need to consider all raw data points, then sort them, thereafter we pick the value at specific index according to the percentile we want. As this data grows, we face the most common big data problem.
How to use percentiles effectively?
Potential solutions to overcome this problem include –>
- Instead of computing 90th for a long duration, compute it for chunks of time and plot the histogram instead of chart. Each chunk would be represented by a bar in histogram.
- Don’t blindly consider 90th percentile metrics reported by monitoring tools. Use the feature of exporting raw matrices.
- Always take a look at multiple granularity level of same data to get better idea.