Wednesday, November 10, 2010

Big load testing

I am a real fan of using the real client to do load testing. Your QA engineers will spend a lot of time building regression tests that verify the behavior of the game is still the same and no new bugs have been introduced. That entails adding scripting or player behavior "simulation" to the game code, but also includes creating the scripts that test the functionality of the game. Those test cases are really important and ideally cover almost all of the game functionality. And they have to be kept up to date as the code in the game changes.

Why not reuse all that work to help load test the server? The scaffolding, client hooks, and test cases?

One of my favorite ways of doing this is to have a test driver that picks random test cases and throws them at the server as fast as possible. Even if the test case involves sleeping or waiting for something like the character walking across some area of the game, if you run enough of them at the same time, you can generate significant load. And it is going to be more realistic than any other kind of test prior to having zillions of real players. It also saves you from having to reproduce the protocol and behavior of the real client and maintain it as the game team evolves everything.

Why not? Even if you take the time to make a headless version of the client, it is probably going to be so resource heavy that you will have trouble finding enough machinery to really ramp up. Most games are designed to tick as fast as possible to give the best framerate, but a headless client doesn't draw, so that is a waste of CPU. Some games rely on the timing intrinsic in animations to control walk speed or action/reaction times for interactions. But you want to strip out as much content as possible to save memory. Clearly there is a bunch of work needed to reduce the footprint of even a headless client. But they really are useful.

One thing you can do to make them more useful is construct a mini server cluster and see how it stands up to as many clients as you can scavenge.

You can get hold of more hardware than you might think by "borrowing" it at night from the corporate pool of workstations. You will need permission, and you will want a fool proof packaging so your clients can be installed (and auto-updated) without manual intervention or a sophisticated user. There is nothing like a robot army to bring your server to its knees. IT doesn't like this idea very much because they like to use night time network bandwidth for doing backup and stuff.

Another important trick is to observe the *slope* of performance changes relative to the change in load you throw at the server. If the marginal effect (incremental server load divided by incremental client load) is > 1 you have a problem. Some people call this non-linear or non-scalable performance. Although, to be technical, it is non-unitary. Non-linear means it is even worse that y = a*x + b. E.g. polynomial (x^2), or exponential (y = a^x). Generally you can find the low hanging fruit pretty easily. If the first 500 connected clients caused a memory increase of 100 MB, but the second 500 caused consumed 200 MB you have a problem. Obviously this applies to CPU, bandwidth and latency. And don't forget to observe DB latency as you crank up both the number of clients and the amount of data already in the DB. You may have forgotten an index.

But you may still not have enough insight, even given all this. The next step could be what I call a light-weight replay client or a wedge-client. The idea is to instrument the headless client, or graphical client and record the parameters being passed into key functions like message send, or web service calls. You are inserting a wedge between the game code and the message passing code. The real client can then be used to create a log of all the interesting data that is needed to stress the server. You would then create a replay client that uses only the lower level libraries. It would read the logs, passing the recorded parameters into a generic function that reproduces the message traffic and web requests. It doesn't have to understand what it is doing. The next step is to replace the values of key parameters to simulate a variety of players. You could use random player ids, or spend some more time having the replay client understand the sequences of logs and server responses. E.g. it could copy a session ID from a server response into all further requests.

Since you are wedging into existing source code, this approach is way easier than doing a network level recording and playback. That would require writing packet parsing code, and creating a state machine to try to simulate what the real client was doing. Very messy.

You might still not be able to replay enough load. Perhaps you don't have enough front end boxes purchased yet, but you want to stress your core server. The DB or event processing system. We use a JMS bus (it is great for publish/subscribe semantics that allows for loose coupling between components) to tie most things together on the back end. We built a record/replay system that pulls apart the JMS messages and does parameters replacement much like the wedge client described above. It is pretty simple to simulate thousands of players banging away. Not every client event results in a back end event that affects the DB.

So what we are planning on doing is:
a) build a mini-cluster with just a few front end boxes
b) use QA's regression test cases to drive them to their knees looking for bad marginal resource usage
c) use wedge recordings and replay if needed for even more load on the front end boxes
d) use the JMS message replay system to drive the event system and DB to its knees, also looking for bad marginal usage.
e) do some shady arithmetic to convince ourselves that the simulated client count that resulted in X% utilization of our test cluster will allow us to get to our target client count in the remaining 100-X% utilization available and the new hardware we plan to have in production.


  1. It's always an interesting topic. All you can test before even a beta, better. for c), what about using the microsoft journaling hook for record/playback commands?. I worked on it ( for another kind of tests in windows OS, but could be a good option less intrusive.

  2. This comment has been removed by the author.

  3. Hi Darrin. I'm glad to see you're keeping your sleeves rolled up. I have a soft spot for load testing as I helped develop large scale load testing products before I got into gaming. A couple of observations FWIW:

    You've effectively reestablished the case that developers should "build testability" into their client/server apps with the particular needs of performance testing in mind -- not just for unit testing. Sadly this message seems all too neglected in the game development community at large.

    For example, by going beyond even headless clients to componentizing the code that provides the remote messaging interface to the game logic so that it can be easily reattached to a separate multithreaded test driver. This "simplifies" the record/replay challenge to one of accurate client state simulation, which is difficult enough without having to worry about the unstable semantics of messaging. This was the technique that we found to yield the highest simulated user count when combined with our distributed logic driver. The problem of course is that this idealistic approach requires commitment early in the design process of the client(/engine) and a lot of discipline to maintain over its lifetime. Neither of which may be applicable to the given title developer.

    You also correctly called out the messy nature of network level record/replay. Our product could do that with a number of published and even some unpublished protocols but it was only effective with established ones. The problem being that between the L3 network protocol state and the business logic state of an application there usually exists some murky application layer protocol state that must also be accurately simulated. Shadowy things like session establishment, key exchange, health monitoring, etc. Aspects of an app that tend to be poorly understood even by the app developers much less the overstretched QA engineer.

    But there is an alternative to both network level record/replay and the almost-as-undesirable source code instrumentation option. Check out a product named Aprobe by OC Systems ( It allows you to symbolically inject custom code (probes) into a running executable that then provides runtime hooks to essentially do whatever you want. So for example you could define probes that trap your high level messaging functions to track relevant game state changes for recording purposes, and then directly drive those functions to simulate alternate client logic. With some imagination one could even envision a test harness that integrates a scripting engine (ahem) so that testers could have script access to the client back-end APIs without altering a single line of client source. They also have a similar tool for JVMs which might be useful on the server back end.