Load Tests

Published on August 15, 2024

Load testing is something that every service needs. While different services are different, I find that many share a similar pattern. If you find yourself building an API or an RPC server, you will probably end up with something that looks like this blog. If you haven’t ran a load test before, give it a read. If you have and want a refresher, give it a read. You know what, give this a read, you might learn something!

First off, load tests are hard. Trying to predict how a service operates is tricky. We can use Little’s law to try to reason about how much traffic each node can take. But that requires estimates on latency, and latency commonly increases under load - making the estimations rather tricky. You can run some simulations or do a probabilistic response, but they the aren’t “real”. Luckily, AWS is designed to spin up transient compute quickly (and relatively cheaply). So running a load test simulating your clients is relatively straightforward. In fact, once you have your set of steps a load test doesn’t take more than a day - so do it before any peak day - at least once or twice a year.

The ultimate goal of the load test is assurances. Most systems fail in one of two ways. As throughput increases the goodput drops to 0 at some threshold (blackouts). Or, as throughput increases goodput drops (brownouts). The goal of the load test is to know where those occur and prevent them. A common misconception is that there is a “maximum” tps that a service can handle. While it is true to some extent, your service actually likely has two different maximums. It makes sense if you reason about it. How much traffic can your service serve with 10 hosts? How much traffic can your service serve with 100 hosts? If you had 10 hosts and tried to serve 100 hosts worth of traffic, what would happen? You have auto-scaling enabled on some semi-reasonable metric, but autoscaling takes time.

Traffic shaping is what glues everything together. We have these limits on our system, how much traffic can a node take, how long does it take to add a node, how many nodes can handle traffic, etc… Traffic shaping lets us … well … shape traffic so we do not brownout or blackout. We can control how fast any customer can call us, we can control the frequency new customers onboard, we can control the number of customers, we can control the maximum tps across all customers. By imposing these limits on our customers, we can actually keep our customers happy. Customers like stable systems. Would you rather commute to work in a 0-60 2.8s car that starts once a week or a car that starts every day reliably but doesn’t go very fast? Sure, that 0-60 car is fun to drive, but unless that’s your job it is not reliable enough.

Traffic shaping can be implemented in a few different ways, but the how does not really matter. What matters is that you understand how your system works and its limits so that you can design a traffic shaping policy around your system. Some traffic shaping will set throughput limits, some will use token buckets, some use special backoffs - the technique doesn’t matter so long as you configure it based upon your learnings.

Okay, enough rambling here is a loose framework which will get you the data points you need

Step 1: Set the minimum number of hosts to 10 (1 onepod + 9 fleet). It will make your math easy. Not strictly required, but 1 host isn’t a large enough sample size and 100 hosts is too expensive - I find 10 is a nice compromise. Also, make sure your configuration matches your actual environment, if your prod has 3AZs then your load test should not be within 1AZ.

Step 2: Run a static TPS generator to fill out a chart that looks like below. The key point here is a static tps generator, you want to go from 0 to 100 TPS immediately without ramp, if it ramps (a really fast ramp is okay) then it will skew your numbers and analysis. You are primarily looking for 2 sets of 2 specific rows, highlighted in yellow/red below which are no more than say 50TPS apart. Other steps can be as large as desired. These tests do not need to be too long, I find that 10-15 mins is generally enough time for something to break if it was going to. Hint - if 100 TPS generates 20 CPU, then 200 TPS will probably generate 40 CPU. This does not need to be a dense table, a few rows will tell you a lot about your system.

TPS	Number of hosts	availability (over duration)	p50 latency	p90 latency	average CPU utilization (scaling criteria)	Scaling?
100	10	100	90	100	20	No
250	10	100	90	100	45	No
275	10	100	90	100	55	Yes
…

500	10	100	90	100	85	Yes
525	10	98	120	100	90	Yes

We learn that we can safetly handle 250 TPS on 10 hosts (25 tps per host) without scaling. We know that handling 500 TPS on 10 hosts (50 TPS per host) is okay, but 525 TPS (52 TPS per host) is not.

Step 3: How long does it take from autoscaling triggering a new host until that new host serves traffic? Lets assume 10 minutes for the purposes of this test.

Step 4: Run a max throughput test until you reach a peak (10000 TPS) or something breaks. You can use a few different steps if you desire, but I prefer to use the safety number (250 tps above) to simulate new client onboarding. Here we found that the system breaks at 1000 tps by dropping availability below desired threshold. This is our service max across all clients. Run this test in reverse with the same steps to replicate descaling to make sure it works as desired/expected.

Time	TPS	Hosts	availability
0:00	250	10	100
0:10	500	20	100
0:20	750	30	100
0:30	1000	40	98

Step 5: Repeat for different sets of apis. I find that GETs and PUTs/CREATEs tend to operate a little differently. If you have a batch which is simply a scaled version of a PUT, no need to test that just scale the limits appropriately.

Now you know a lot of things about your service! You know that your service starts degrading/breaking at 1000K TPS, so that should probably be your maximum TPS. If you start with 6 hosts default, you know that you can safetly handle 25 tps per host - so your baseline TPS is 300. If that isn’t high enough, then start with more hosts. The absolute worst case scenario you can be in is if your system is sitting right before scaling (25 TPS/host) and a new client maxes out their limits (50 TPS/host). You normally want a little wiggle room there, but if we assume the 6 host default with a safety margin of 2, then you can safetly handle a new customer maxing out (50-25 TPS /2 * 6 → 75 TPS). Obviously, once you start having a lot more traffic and have a more consistent floor which is dictated by scaling rather than default hosts - then you can adjust these parameters accordingly.

After you have run a load tests, you will probably want to repeat with a few different settings. If you are on an RPC server, how many threads do you have serving traffic? If you are on fargate what is your CPU/Memory config? If you are using auto-scaling how aggressive is your fan out/in policy? How many baseline hosts do you want? You can play with all these different configurations and start tuning not just the limits of your system, but the performance as well.

Your initial effort is a load test to establish these limits, but I recommend going further too. Investigate those charts and see if you spot anything anomalous. Are the 500s correlated to when a node is scaling out or scaling in? Are you getting client throttles way too early or way too late? When you start investigating the breaks, that can be a fun adventure. The top 3 things that I see break people are cloudwatch limits (how you write metrics/logs to cloudwatch), logging itself taking too much memory/disk and crashing, or hot partitions in dynamo.

Fun little story time. I have built a bunch of these apis, but most were not low latency, high availability, high impact apis. I finally worked on one of those systems and we had 2 clients at the start. Luckily they were internal amazon customers, but they called us in their critical path flow from customer requests - so we were pretty much customer facing. The first service onboarded and a week or two went by, then the second service onboarded and a week or two went by. We had traffic and everything looked good. Then on some random Friday afternoon we got a ticket that our service was throwing a lot of 429s. When we investigated, sure enough we were throttling service 2 really hard. Turns out they opened up a new feature (without telling us) and had an infinite retry/backoff storm and blacked themselves out. That was fun because they actually managed to max out their ddb capacity so their ops tools didnt work and they maxed out their lambda concurrency so they were running into problems with deployments. What happened to our service? Nothing. What happened to our other customer? Nothing. What happened to our oncall? We went out and laughed about it over a drink while our service continued to operate normally.

← Back to all posts

TPS	Number of hosts	availability (over duration)	p50 latency	p90 latency	average CPU utilization (scaling criteria)	Scaling?
100	10	100	90	100	20	No
250	10	100	90	100	45	No
275	10	100	90	100	55	Yes
…

500	10	100	90	100	85	Yes
525	10	98	120	100	90	Yes

TPS	Number of hosts	availability (over duration)	p50 latency	p90 latency	average CPU utilization (scaling criteria)	Scaling?
100	10	100	90	100	20	No
250	10	100	90	100	45	No
275	10	100	90	100	55	Yes
…

500	10	100	90	100	85	Yes
525	10	98	120	100	90	Yes

TPS	Number of hosts	availability (over duration)	p50 latency	p90 latency	average CPU utilization (scaling criteria)	Scaling?
100	10	100	90	100	20	No
250	10	100	90	100	45	No
275	10	100	90	100	55	Yes
…

500	10	100	90	100	85	Yes
525	10	98	120	100	90	Yes