Online Game Techniques

Spring Boot Applications in Kubernetes

2019-06-06T17:28:00.000-07:00

Goals

Spring Boot is a very convenient framework for developing Java applications. It supports lots of useful features just by mentioning the package in your Gradle or Maven files.

Building the application into a container and running the image in Docker allows you to have a very consistent experience in development, testing, and live. It also solves a lot of configuration and networking issues that used to be solved with a lot of custom scripting that usually broke *during* a maintenance window.

Running a distributed system (like a game server) requires deployment, and monitoring (called orchestration). The best solution for that right now is Kubernetes (K8S). It makes use of Containers to run the app, and watches its health. It can restart failed instances, and can dynamically and automatically scale up new instances based on load. When integrated with a cloud provider like Google or AWS, it can also spin up new servers (and you get charged a bit more, or a bit less when they scale down again). Another awesome benefit of K8S is that you can redeploy your whole server without a maintenance window. It will shut down an old Pod, and spin up a replacement with your new build. There are challenges to doing that in some cases that I'll talk through in a later post.

But there are a few tricks I've had to discover to make all this work well together:

Have the docker Command be able to use variables so you can control things like startup memory usage.
Have the JVM catch a signal and pass it to the application so it can shut down cleanly when K8S decides to scale down or replace a Pod. K8S goes through a two step process: it sends SIGINT first, then waits, then sends SIGKILL if the app doesn't shut down on its own.
Get Spring Boot configuration into the application, and be able to override it with cluster-specific settings from a K8S ConfigMap.

Config Printing

The first challenge is *knowing* what configuration your app started with. There is no turn-key solution to this. So I include a little code that queries Spring and logs the config that was used to start the app. Very useful for debugging deployments. I also expose it to JMX (not shown):

    @Autowired
    ConfigurableEnvironment env;
    ... 

        MutablePropertySources sources = env.getPropertySources();

        // Find the name of every property...
        Set<String> uniqueNames = new HashSet<>();
        for (PropertySource<?> source : sources){
            if (source instanceof EnumerablePropertySource){
                uniqueNames.addAll(Arrays.asList(((EnumerablePropertySource) source).getPropertyNames()));
            }
        }

        // Use a TreeMap so the output is sorted.
        TreeMap<String, String> sortedProps = new TreeMap();

        for (String name : uniqueNames){
            // Read the property value, using Spring's resolution rules.
            String value = env.getProperty(name);
            if (name.toLowerCase().contains("pass") || name.toLowerCase().contains("secret")){
                value = "****"; // Don't show passwords.
            }
            sortedProps.put(name, value);
        }

        StringBuilder sb = new StringBuilder();
        sb.append("# Merged application properties.\n");

        // TODO: this doesn't really preserve the "file" exactly. E.g. special characters, continuations, comments, ...
        for (Map.Entry<String, String> el : sortedProps.entrySet()){
            sb.append(el.getKey()).append("\t=").append(el.getValue()).append("\n");
        }

        // String res = "EnvVars: " + envVars.toString() + " SystemProperties: " + systemProperties.toString() + " AppProperties: " + os.toString();
        String res = sb.toString();

This grabs all the configuration names, then uses Spring's configuration system to look up the current value of each (based on its External Config priorities). It also hides passwords, because you shouldn't log such things.

Config Override

This is an example ConfigMap definition:

# kubectl apply -f server-config.yml
apiVersion: v1
kind: ConfigMap
metadata:
  name: server-config

data:
  server.port: "8090"
  api.server.port: "8090"
  management.server.port: "8090"

  logging.file: protoserver.log
  logging.level: INFO
  logging.level.com.protag.protoserver: INFO

  # The environment variable in the dockerfile that controls -Xmx${HEAP_SIZE} when the jvm starts in the container.
  HEAP_SIZE: 2400m

Deployment

This is an example deployment that picks up that configMap:

# kubectl apply -f server-deployment
# With "parameter" substitution:
# sed -s 's/BUILD_TAG/99/g' &lt; server-deployment.yml | kubectl apply -f -
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: server
spec:
  replicas: 2
  template:
    metadata:
      labels:
        configmap-version: v8 # Fake label to force rolling update upon configMap change. Could use its md5 hash too.    
    spec:
      containers:
        - name: server
          image: localhost:5000/ourorg/server:BUILD_TAG
          # Pull config data in as environment variables, where springboot will merge them w application.properties.
          # There is currently no way to auto-redeploy the pod when this config changes. Some people hash the config.yml
          # file, and sed the hash into the deployment.yml, which is recognized as a big enough change to trigger
          # a RollingUpdate. Sigh.
          envFrom:
          - configMapRef:
            name: server-config
          # Make sure to only use "bash" where needed. "sh" strips dotted env vars, which spoils application.properties coming
          # through configMaps, and "env:" settings.
          ports:
          # the service
          - containerPort: 8090
            name: web
          - containerPort: 9010
            name: jmx

This uses envFrom to pull the ConfigMap above into the environment variables of the container. When Spring Boot starts up, those override what is in application.properties.

One odd thing, here is that "sh" will discard variable names that contain a dot. So use "bash", as you will see in the Dockerfile below.

The label configmap-version is used to force a rolling update if I change the configuration. I update that label and update the deployment. K8S will then restart each pod so it picks up that new config.

The string BUILD_TAG is used to ensure that an image produced by Jenkins or whatever is the actual one that K8S pulls from. Using LATEST is not reliable, and you can visibly see which image was used when running "kubectl describe".

Dockerfile

This is an example Dockerfile, used to create an image containing your Spring Boot app:

# Creates ourorg/server
FROM openjdk:11-jre-stretch
# May need a new base image. I don't see it on dockerhub anymore (only a few weeks later)

# Pick up variable passed in from gradle.build file
ARG JAR_FILE
ENV JAR_FILE=${JAR_FILE}

# JVM memory. Default is 128m. Make this less than the resource request in statefulSet.yml
ARG HEAP_SIZE=900m
ENV HEAP_SIZE=${HEAP_SIZE}

# We want the application.properties file to be loose, so it can be seen and edited more easily.
COPY ${JAR_FILE} application.properties /app/

# Run in /app to pick up application.properties, and drop logs there.
WORKDIR /app

# App and jmx ports
EXPOSE 8090/tcp 9010/tcp

# Allow SIGINT signals hit java app directly (by exec'ing over the initial shell). But still use a shell to
# expand variables. SIGINT is used by k8s to trigger graceful shutdown.
# Don't use "sh", it will strip dotted env vars, and you'll lose the configMap settings to override application.properties
ENTRYPOINT [ "bash", "-c", \
    "exec java -Xmx${HEAP_SIZE} \
               -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9010 -Dcom.sun.management.jmxremote.local.only=false \
               -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false \
               -jar ${JAR_FILE} \
          && echo Completed unexpected. >  /dev/termination-log"]

This Dockerfile allows the build system to specify the jar name, and starting java heap size. The way we are using these variables allows them to be overridden by K8S directly, or in a ConfigMap. Using variables in an ENTRYPOINT is not normally possible in Docker because it doesn't use a shell. Here, however, we explicitly run the command using "bash" as the shell. This allows the dotted environment variables to be passed through.

The reason for using "exec" in the ENTRYPOINT is so that the jvm is the process that receives the INT and KILL signals from K8S. Without that, the wrapping shell would not pass the INT signal to our application. It is needed to tell Spring Boot to start shutting down gracefully.

Summary

Well, I think that covers all my tricks for getting a Spring Boot app to work properly in Docker and Kubernetes. Let me know if you have questions, or trouble getting this working as I suggest. Or if there are other difficulties you've hit that you'd like help with.

My new company: Combine Games; My new game: Rails and Riches

2015-05-26T13:37:00.000-07:00

Thought I'd share the news, if you hadn't heard. A few friends and I just started a new company called Combine Games based in the Seattle area. We are running a Kickstarter to fund a game called Rails and Riches. You can think of it as a successor to Railroad Tycoon, because Franz Felsl, the designer and artist on Railroad Tycoon 2 and 3 is our lead designer and CCO. You can find us at http://www.railsandriches.com/, or on Facebook: https://www.facebook.com/RailsAndRiches. I just posted a long blog over there: http://www.railsandriches.com/blog/view/6. Come make an account, share or like on Facebook. And keep an ear out for when the Kickstarter goes live.

Thread pools, event dispatching, and avoiding sychronization

2015-02-17T11:28:00.002-08:00

Writing ad hoc threaded code is almost always more difficult than it is worth. You might be able to get it to work, but the next guy in your code is probably going to break it by accident, or find it so hard to deal with, they will want to rewrite it. So you want some kind of simplifying policy about how to develop threaded code. It needs to perform well (the primary reason for doing threading in the first place, other than dealing with blocking), require little mental overhead or boiler plate code, and ideally be able to be used by more junior developers than yourself.

Functional parallelism only extracts a small fraction of available performance, at least in larger, "interesting" problems. Data parallelism is often better at extracting more performance. In the case of large online game servers, the ideal thing to parallelize across are game entities. There are other types of entities that we can generalize to, as well, once this is working.

Here's what I've come up with...

Generally, you want to avoid ticking entities as fast as possible. While this makes a certain amount of sense on the client (one tick per graphics frame) to provide the smoothest visual experience, on the server it will consume 100% of the CPU without significantly improving the player's experience. One big downside is that you can't tell when you need to add more CPUs to your cluster. So I prefer event driven models of computing. Schedule an event to occur when there is something necessary to do. In some cases, you may want to have periodic events, but decide what rate is appropriate, and schedule only those. In this way, you can easily see the intrinsic load on a machine rise as more work (entities) are being handled. So the core assumption is: we have a collection of entities, handling a stream of events.

We built an event dispatcher. It is responsible for distributing events (that usually arrive as messages) to their target entities. I've discussed a number of means to route those messages and events in other posts. When an event is available, the system has the entity consume the event, and runs an event handler designed for that type of event.

You don't want to spawn a thread per entity. There could be thousands of entities per processor, and that would cause inefficient context switching, and bloat memory use for all those threads sitting around when only a few can run at a time. You also don't want to create and destroy threads every time an entity is created or destroyed. Instead, you want to be in control of the number of threads spawned regardless of the workload. That allows you to tune the number of threads for maximum performance and adjust to the current state of affairs in the data center. You create a thread pool ahead of time, then map the work to those threads.

It would be easy enough to use an Executor (java), and Runnables. But this leads to an unwanted problem. If there are multiple events scheduled for a single entity, this naive arrangement might map two events for the same target to different threads. Consequently, you would have to put in a full set of concurrency controls, guarding all data structures that were shared, including the entity itself. This would counteract the effect of running on two threads, and make the arrangement worthless.

Instead, what I did was to create a separate event queue per entity, and make a custom event dispatcher that pulled only a single event off the queue. The Runnable and Executor in this better set up are managing entities (ones with one or more events on their private queues). Any entity can be chosen to run, but each entity will only work on a single event at a time. This makes more efficient use of the threads in the pool (no blocking between them), and obviates the need for concurrency control (other than in the queues). Event handlers can now be written as if they are single threaded. This makes game logic development easier for more junior programmers, and for those less familiar with server code or with Java (e.g. client programmers).

Obviously, this applies to data that is owned by a single entity. Anything shared between entities still needs to be guarded. In general, you want to avoid that kind of sharing between entities, especially if you have a large scale game where entities might be mapped across multiple machines. You can't share data between two entities if they are in different processes on different machines (well there is one way, but it requires fancy hardware or device drivers, but that is another story). So you will be building for the distributed case anyway. Why not do it the same way when the entities are local?

I found a number of posts about executors and event dispatching that touched on these ideas, but there was nothing official, and there seemed to be a lot of debate about good ways to do it. I'm here to say it worked great. I'd love to post the code for this. Again, maybe when Quantum is available, you'll see some of this.

Dynamic Serialization of Messages

2015-02-16T21:16:00.000-08:00

Call it message marshalling, streaming, or serialization, you'll have to convert from a message data structure to a series of bytes and back, if you want to send a message between two processes. You might cast the message pointer to a byte pointer, take the size and slam the bytes into a the socket, but you won't be able to deal with different cpu architectures (byte ordering), there may be hidden padding between fields, or unwanted fields, there may be tighter representations on the wire than you have in memory. There are many reasons to have some kind of serialization system. The following version has some of the best trade offs I've come across. It was partially inspired by simple json parsers like LitJSON, and partially by a library called MessagePack, which discover the fields of a data structure automatically.

So I kind of hate the boiler plate and manual drudgery of ToStream/FromStream functions added to each message subclass. It always seemed like there should be an automatic way of implementing that code. Google Protocol Buffers, or Thrift and others make you specify your data structures in a separate language, then run a compiler to generate the message classes and serialization code. That always seemed clumsy to me, and was extra work to deal with excluding the generated files from source control, more custom stuff in your maven, makefile, or ms proj files. Plus I always think of the messages as *my* code, not something you generate. These are personal preferences that led to the energy needed to come up with the idea, not necessarily full justification for what resulted. In the end, the continued justification is that it is super easy to maintain, and has a very desirable side effect of fixing a long standing problem of mismatching protocols between client and server. So here's the outline (maybe I'll post the code as part of the Quantum system some day).

We have a large number of message classes, but are lazy, and don't want to write serialization code, and we always had bugs where the server's version of the class didn't match the client's. The server is in Java, and the client is in C#. Maybe the byte order of the client and server are different. Ideally, we could say: Send(m), where m is a pointer to any object, and the message is just sent. Here's how:
- Use introspection (Java calls it Reflection), to determine the most derived class of m, if you have a reflection based serializer constructed for that type, use it, else create one.
- To create one, walk each field of the type, and construct a ReaderWriter instance for that type, appending it to list for the message type. Do this recursively. ReaderWriter classes are created for each atomic type, and can be created for any custom type (like lists, dictionaries, or application classes that you think you can serialize more efficiently). Cache the result so you only do this once. You may want to do this for all message types ahead of time, but that is optional. You could do it on the first call to Send(m)
- As you send each message, find its serializer, and hand the message instance in. The system will walk the list of ReaderWriters and will serialize the whole message. To make this work, the ReaderWriter classes must use reflection to access the field (by reading during send, and by writing during message arrival). This is pretty easy to do in Java, and C#, or interpretted languages like Python, Lua and Ruby. C++ would be a special case where code generation or template and macro tricks would be needed. Or good old fashioned boiler plate. Sigh.
- As a message arrives, you allocate an instance of the type (again, using reflection), then look up the serializer, and fill it in from the byte buffer coming in.

This works fine across languages and architectures. There is a small performance hit in reading or setting the fields using reflection, but it is not too bad, since you only scan the type once at startup, and you keep the accessor classes around so you don't have to recreate anything for each message.

Once you have all this metadata about each message type, it is easy to see how you can make a checksum that will change any time you modify the message class definition. That checksum can be sent when the client first connects to the server. If you connect to an old build, you will get a warning or be disconnected. It can include enough detail that the warning will tell you exactly which class doesn't match, and which field is wrong. You may not want this debugging info in your shipping product (why make it easy for hackers by giving them your protocol description), but the checksums could be retained. The checksum would include the message field names, and types, so any change will trigger a warning. We chose to sort the field serialization alphabetically, and ignored capitalization. That way differences in field order on the client and server didn't matter, and capitalization differences due to language naming conventions were ignored. And atomic types were mapped appropriately.

Another consideration was to delay deserialization as long as possible. That way intermediate processes (like an Edge Server) didn't have to have every message class compiled in. Message buffers could be forwarded as byte buffers without having to pay deserialization/serialization costs. This also allowed deserialization to occur on the target thread of a multi-threaded event handling system.

One necessary code overhead in this system is that the class type of each arriving message has to be pre-registered with the system, otherwise we can't determine which constructor to run with reflection, and we don't know which reflection based serializer to use (or construct, if this is the first use of it since the app started). We need a mapping between the message type identifier in the message header, and the run time type. This registration code allows message types to be compact (an enum, or integer), instead of using the type name as a string (which could be used for reflection lookup, but seemed too much overhead per message send). It has a nice side effect, which is we know every type that is going to be used, so the protocol checking system can make all the checksums ahead of time and verify them when a client makes a connection (instead of waiting for the first instance of each message to detect the mismatch). We might have been able to make use of message handler registration to deduce this message type list, and ignore any messages that arrived that had no handlers.

Some of these features exist in competing libraries, but not all of them were available when we built our system. For example, MessagePack didn't have checksums to validate type matching.

Time Management and Synchronization

2012-08-15T00:26:00.003-07:00

A key principle of online games is synchronization. Keeping the various players in synch means giving them practically identical and consistent experiences of the virtual world. The reason this is hard is that they are separated by network delays, and worse, those delays are variable.

The ideal situation would be that all players saw exactly the same thing, and that there were no delays. How do we approach that ideal? Consider a use case: there is an observer that is watching two other players (A and B) racing toward a finish line. The observer should see the correct player cross first, and see that happen at the correct time delay from the beginning of the race. To make that sentence make sense, we need a consistent notion of time. I call this Virtual Time, and my views are drawn from computer science theory of the same name, and from the world of Distributed Interactive Simulation.

Let's say the start of the race is time zero. Both racers' clients make the start signal go green at time zero, and they begin accelerating. They send position updates to each other and the observer throughout the race, and try to render the other's position "accurately". What would that look like? Without any adjustment, on A's screen, he would start moving forward. After some network latency, B would start moving forward, and remain behind A all the way to the finish line. But on B's screen, B would start moving forward. After some network latency, B would start receiving updates from A, and would render him back at the start line, then moving forward, always behind. Who won? The Observer would see A's and B's updates arrive at approximately the same time, and pass the finish line at approximately the same time (assuming they accelerate nearly equally). Three different experiences. Unacceptable.

Instead, we add time stamps to every update message. When A starts moving the first update message has a virtual time stamp of zero. Instead of just a position, the update has velocity, and maybe acceleration information as well as the time stamp. When it arrives at B, some time has passed. Let's say it is Virtual Time 5 when it arrives. B will dead reckon A using the initial position, time stamp, and velocity of that time zero update to predict A's position for time 5. B will see himself and A approximately neck and neck. Much better. The Observer would see the same thing. As would A (by dead reckoning B's position).

This approach still has latency artifacts, but they are much reduced. For example, if A veered off from a straightline acceleration to the finish line, B would not know until after a little network delay. In the mean time, B would have rendered A "incorrectly". We expect this. When B gets the post-veering update, it will use it, dead reckon A's position to B's current Virtual Time, and compute the best possible approximation of A's current position. But the previous frame, B would have rendered A as being still on a straight line course. To adjust for this unexpected turn, B will have to start correcting A's rendered position smoothly. The dead reckoned position becomes a Goal or Target position that the consumer incrementally corrects for. If the delta is very large, you may want to "pop" the position to the correct estimate and get it over with. You can use the vehicle physics limits (or a little above) to decide how rapidly to correct its position. In any case, there are lots of ways to believably correct the rendered position to the estimated position.

Note that during the start of the race, at first B would still see A sitting still for a period of time, since it won't get the first update until after a network delay. But after the first delay, it will see that the estimated position should be up near where B has already moved itself. The correction mechanism will quickly match A's rendered position to the dead reckoned one. This may be perceived as impossibly large acceleration for that vehicle. But incorrect acceleration is less easily detected by a person than incorrect position, or teleporting.

Another use case is if A and B are firing at each other. Imagine that B is moving across in front of A, and A fires a missile right when B appears directly in A's path. Let's assume that the client that owns the target always determines if a missile hits. If we were not using dead reckoning, by the time the missile update arrived back at B, B would have been 2 network delays off to the side, and the missile would surely miss. If we are using dead reckoning, A would be firing when B's estimated position was on target. There might be some discrepancy if B was veering from the predictable path. But there is still one more network latency to be concerned about. By the time the firing message arrives at B, B would be one network latency off to the side. B could apply some dead reckoning for the missile, flying it out a few Virtual Time ticks, but the angle from the firing point might be unacceptable, and would definitely not be the same as what A experienced.

A better, more synchronized, more fair experience would be to delay the firing of the missile on A, while sending the firing message immediately. In other words, when A presses the fire button, schedule the execution of the fire event 5 Virtual Time units in the future. Send that future event to B, and to itself. On both A and B, the fire event will occur simultaneously. I call this a "warm up" animation. The mage winds up, the missile smokes a bit before it fires, or whatever. The player has to learn to account for that little delay if they want the missile to hit.

If you have a game where firing is instantaneous (a rifle, or laser), this is a pretty much impossible problem. You have to put in custom code for this. The shooter decides ahead of time where the bullet will hit, and tells the target. By the time the target is told, both shooter and target have moved, so the rendering of the bullet fly out is not in the same location in space, and will possibly go through a wall. The point of impact is also prone to cheating, of course. I don't like games that have aiming for this reason. They are very difficult to make fair and cheat proof. I prefer target based games. You select a target, then use your weapons on it. The distributed system can then determine whether it worked. Aiming, and determining hits, shooting around corners, etc. is a tough problem. FPS engines are really tough. You should seriously consider whether your game needs to bite off that problem, or if you can make it fun enough using a target based system, and a little randomness. Then add decorative effects that express what the dice said.

The key take away with Time Management, Virtual Time, and synchronization is that whichever client actions are occurring on (A, B, or Observer), they occur in the same time axis. There are nice time synchronization algorithms (look up NTP), that rely on periodically exchanging time stamps using direct messages. And you can solve a lot of problems using events that are scheduled to occur in the future based on Virtual Time timestamps. One challenge though is how to keep the simulation that is being run using Virtual Time in synch with animation and physics that you might be tempted to run using real time. Or what you think is real time. You may have a game that you can pause. How does that work? Do you freeze the animation or not? You are already dealing with different concepts of time. Consider adding Virtual Time. And the notion of Goal positions.

Let's Call it MLT.

2012-05-29T22:53:00.001-07:00

I need a name for the Ultimate Online Game Framework we've been discussing so people know what I'm refering to (crazy idea, or something for that project). I want it to be the greatest thing in the world. And as everyone knows, that is a nice MLT (Mutton, Lettuce and Tomato sandwich). I was toying with calling it Quantum, because I like the thought of Quantum Entanglement to represent entity state replication. So maybe the Quantum Distributed Entity System (QDES), and later the Quantum this and that.

I'm also really stuck on what language to write it in. C++ or Java. I think it is critical that a game development team use the same language on client and server. Otherwise the team gets split in half, and winds up not talking enough. Being able to swap people onto what ever needs work is really important for keeping to a schedule. And hiring from two piles of expertise is just that much harder. Java has so many great libraries. But C++ is so much more popular in the game industry. On mobile, you've got Objective-C. If you're using Unity, you've got C#.

To be honest, most of your game logic should probably be written (at least initially) in your favorite scripting language. Java and C++ both support Python, Lua, Javascript for example. IKVM appears to allow Mono to be embedded in Java, so you could maybe even do C# on both sides. This argument (http://www.mono-project.com/Scripting_With_Mono) is certainly compelling. Ever heard of Gecko? It embeds a C++ compiler so you can compile and load/run C++ "scripts" at run time. The reason I bring up all these options is: you could implement client and server in different languages, but embed the same scripting engine in both sides, so the game developers get the benefit of reusing their staff on both sides.

If the architecture and design of QDES is awesome enough, someone might reimplement it in their favorite language anyway. Or port it to a very different platform that doesn't support whatever language choice we make. So maybe the choice I'm stuck on isn't that serious right now. Maybe I should just start with the architecture and design.

I'm drafting some high level requirements and am thinking I'll post them here on a "page" instead of a blog article. Not sure how comments and such would work with that however. But I wanted something stickier than a post and something I could update and have a single reference to. Same with the designs as they start to come.

Anyway. Thanks for being a sounding board. Maybe some of you and your friends will have time to contribute to (or try out) all this. (And I haven't completely forgotten about doing that tech survey to see what other similar projects are out there in the open source world. I found a few. Some are a little dusty. Some are a little off track from what I'm thinking.)

Essential Components of the New Framework

2012-05-25T12:01:00.000-07:00

What are the components that are absolutely essential to get right in the Ultimate MMO Framework (still need a catchy name)? If had those components today, you'd be using them in your current game, even if you had to integrate them to a bunch of other stuff you have to get a full solution. If we could build those essential pieces, and they were free, I think they would slowly take over the industry, shouldering less effective solutions out of the way. If each piece can be made independent of the others, there is a better chance the system will be adopted even if someone thinks it isn't all a perfect fit.

It is tempting to start with message passing. After all, it isn't an online game without that. But there are a lot of great message passing systems. There might even be some decent event handling systems. And there are definitely some good socket servers that combine the two. While I see problems with these that could be improved, and I see missing features, I think message passing is a second priority. You could make do with an existing system, and swap it out later. Although, I don't know how you can get away from the assumption that you have a publish/subscribe api that the higher levels can use.

More essential is the Entity System. It is where game play programmers interact with your system, and when done well, much of the rest of the system gets abstracted away. The decisions made here are what enable a server to scale, a client to stay in synch, content to get reused in interesting ways, and game play to be developed efficiently. BTW, what I mean by an Entity is a game object. Something that represents a thing or concept in the game world. The term comes from discrete event simulation. Jumping ahead a bit, the Entity System needs to be Component oriented such that an Entity is aggregated from a collection of Components. The Entity State system is the basis of replication to support distributed computation and load balancing, and of persistence. Done well, the Entity System can be used for more than just visible game objects, but could support administrative and configuration objects as well.

Related, but probably separatable is the Behavior system. How do get Entities to do something, and how do they interact with one another? I don't mean AI, I mean behavior in the OO encapsulated-state-and-behavior sense. It will be interesting to distinguish between and tie together AI "plans", complex sequences of actions, and individual behavior "methods". And, of course, the question of languages, scripting and debugging land right here. (A second priority that relates to Behaviors is time management. How do you synchronize client and server execution, can you pause, can you persist an in-flight behavior?)

Content Tools are the next big ticket item. If the Framework is stable enough, a new game could be made almost exclusively using these tools. Close to 100% of your budget would be spent pushing content through them in some imaginary perfect world. These tools allow for rapid iteration and debugging across multiple machines, and multiple platforms. It is not clear how much of the tools can be made independent of Entity state and behavior choices.

What other systems are absolutely essential? They drive everything else? They don't already exist, and you feel like you always have to rebuild them?

What do you think? Wouldn't it be cool to have a standalone Entity System that you could use in your next game? What if it came with a collection of Components that solved a lot of major problems like movement, collision, visibility, containment?

Why I don't like threads

2012-05-19T23:55:00.000-07:00

People say I'm crazy because I don't like threads in this day and age of multicore processors. But I have good reason based on years of experience, so hear me out.

Doing threading well is harder than most people think. Making a decision to multithread your application imposes on all developers, including those less experienced with such things than the ones normally making that decision. I like to think that framework and server developers are much more experienced with such things than game developers who might be plugging stuff into the server. And framework developers are building their framework to be used by those game play programmers. Play to your audience. Wouldn't it be better if they never had to know about threading issues? I believe it is possible to build a framework that insulates regular developers from such concerns (the communicating sequential process model).

Now I'm a big fan of using threading for what it is good for: dealing with interrupts that should be serviced asap without polluting your main application with lots of polling and checking. Examples of this would be a background thread for servicing the network (you want to drain the network buffers quickly so they don't overflow and back things up and cause retransmissions); background file or keyboard I/O which needs to rarely wake up and service an incoming or outgoing IO buffer; remote requests that block for a long time and would otherwise stall the app (like a DB request, or http request). Note in particular that none of these are high performance computations. They are really dominated by the blocking/waiting time. The use of a thread in this case is really all about an easier programming model. The background thread can be written as a loop that reads or writes using blocking, and since it is not done on the main thread, the main thread doesn't have checks and polling sprinkled in.

Most of the time, when you have some heavy computation to do, it will eventually scale up to require more than a single machine anyway. So you are going to have to write your system to be distributed and communicate between the machines using messages anyway. If you have already done that work, you can easily use it within a single machine that has many cores. If you try to build a system that makes use of the many cores within a machine by using threads, and you also solve the distributed case, you've doubled your work and maintenance effort. One of the best ways to decompose a problem to be solved by worker threads is to deliver the work in a task queue and have them contend for it. As each is pulled off, it is processed by a handler function. This is exactly the same approach you would use for the distributed case. So the only difference is that in one case you have messages passing between processes on the same machine, or task-messages passing between threads. Yes, I understand the performance difference. But if your app is that performance sensitive, the inter-process message passing can be implemented using shared memory and avoid the kernel switches when delivering a message to the current machine. The intent here is to save you from the double implementation, save your framework users from having to deal with thread programming, and the performance difference is pretty small.

There is also a big problem with heavily threaded apps in production. There are pretty lousy tools for helping you figure out thread related performance problems. When you are only dealing with background interrupt-handling threads, there are not going to be serious performance problems unless one of the background threads starts polling wildly and consuming 100% cpu. But if a highly multithreaded app starts using too much CPU, or starts being unresponsive, how do you tell what is actually happening among the various threads? They don't have names, and the kernel isn't very good about helping you keep track of what work happens on each thread. You wind up having to build your own instrumentation into the application every time there is such a problem. And doing a new build and getting it into production is a lot of effort. On the other hand, if you follow the distributed model, you can easily see which process is spiking CPU. You can easily instrument the message flows between processes to see if there is too much or too little inter-process communication. Often you wind up logging all such traffic for post-mortem analysis anyway. Remember, you are not likely to have the luxury of attaching a debugger to a production process to grab stack traces, or what not. So you are going to be staring at a monolithic multithreaded app and trying to guess what is going on inside.

Problems in threaded apps tend to be subtle, they wind up being hard to debug, and often only show up after the app has been running at production loads for quite a while. Writing good threaded software is the responsibility of every programmer that touches an app that adopts it, and the least experienced programmer in that app is the one you have to worry about. Operating and debugging the app in production is not easy. These are the reasons I don't like threads. I'm not afraid of them. I understand them all too well. I think there are better ways (CSP) that are actually easier and faster to develop for in the first place. And you are likely to have to adopt those ways in any case as you scale beyond a single machine.

More thoughts on this subject here: http://onlinegametechniques.blogspot.com/2009/02/manifesto-of-multithreading-for-high.html

(Any position statement like this is going to sound a little nuts if you try to apply to every conceivable situation. What I was thinking about when I wrote this was a server application, in particular, something large scale, and event oriented. If, for example, you are writing a graphical client on a 360, using multiple processes would be looney. Multiple processes listening on the same socket can be a problem (with some workarounds). You might not be able to abide even a shared memory message passing delay between components, like in a rendering pipeline. Interestingly, these examples are all amenable to the same analysis: what is the response time required, what resources are being shared, how much will the computation have to scale up, is the physical hardware intrinsically distributed anyway? My point is the default answer should be to encapsulate any threading you *have to* do so that the bulk of your development doesn't have to pay the daily overhead of always asking: is that line of code accessing anything shared; is this data structure thread safe? It slows down important conversations, and it leaves shadows of a doubt everywhere.)

Ultimate Open Source Game Development System

2012-05-15T09:25:00.000-07:00

I'm between jobs again. It happens a lot in this industry. One of the frustrating things about that is you leave behind investments you've made in building infrastructure that was supposed to save you effort on future projects. If the company you're leaving fails, that infrastructure investment won't help anyone. So what do you do? Just build another one for the next company? Rinse, repeat. Buy something? Can't influence that much. Build a company that sells infrastructure? Pretty hard to make a profit selling to us picky developers. Open source? Let's think about that...

If I was to build an open source game development system, what would I focus on? I'm not a graphics whiz, but I know about simulation, OO, distributed systems, MMOs, and picky developers. Let's see here...

Pick a primary development language. But realize that not all developers will love it. Is there a way to support multiple languages?
Rapid iteration is key, so game logic must be able to be written in a scripting language. But it must also be possible to hard code parts of it for performance.
Tools are key. In a good game development project the majority of the effort of the team should feed through the content tools, not the compiler. Wouldn't it be ideal if you could build a great game without any programmers?
It must be debuggable. In a distributed system, this requires some thought.
The world size must be able to scale. A lot of projects are bending their game design to avoid this problem, and that is definitely the least expensive approach. But if your infrastructure supports large scale on day one, what could you do?
You want reuse of game logic, realizing different developers/designers have different skills. This means you want a hierachy of game elements that are developed by appropriate experts, and snapped together by others. This game object and level design effort should be efficient and fun.
The team size must be able to scale; both up and down. This is a matter of content management. You don't want central locked files everyone contends for, and you don't want burdensome processes if you are a small team.
It should be runtime efficient in space and time. This enables use on games that have huge numbers of game object instances.
It is easy to ask for dynamic object definition, but that can work against performance. How often is that used? And what other ways are there realize that effect?
The framework should be intuitive to as many people as possible. This means being very careful about terminology.
Now we enter an area I don't know much about. How do you structure the infrastructure development project itself? You want buy in from lots of people, but you also need a single vision so the result is consistent. You need a means to adjust the vision without blowing up the whole effort, and without encouraging people to fork the effort.
What will be the relationship to other open source projects? We can choose to use various utility libraries. But what about using other game projects for graphics? Issues will arise at the boundaries.
The system should be useful and get some adoption even without being finished. Because it will never be finished.
It should be modular enough. That allows a developer to use the parts they want, and replace the parts they must. It allows broken parts to be replaced.
We will need a demo game. How ambitious should it be?
We need a name. And a vision statement.

Is this kind of thing really needed by our industry? Why is so much infrastruture rewritten all the time? Is it just the projects I've hit? Do we always push the envelope such that if this existed today we'd spend our entire engineering budget extending it? Or is it a fundamental truth that there can be no Ultimate infrastructure because every game is too different? Lost cause, or an idea whose time has come.

I think the first thing to do is a survey the current state of open source game systems.

Techniques for Handling Cheating (Part 1)

2011-06-02T23:31:00.000-07:00

Cheating is fun for some people. It is a game on top of your game. "Can I find a path through the maze of security mechanisms you have laid in my path?"

First, why does a developer, care about cheating in online games?

They spent a lot of effort making content so they want to make sure players experience it instead of skipping over it and "stealing" the reward. The idea being that the players will have more fun facing the challenges and beating them. They'll appreciate it more if they have to work for it. Maybe. Some people are weird, and get a sense of appreciation out of working through the cheats.
Cheating can directly interfere with other player's enjoyment of the content. E.g. griefing, stealing their stuff,...
The perception of unfairness (everyone else has all the goodies, and you don't; you can't win PvP without also cheating; ...). Players can get frustrated by this and leave, and the developer loses money.
It can interfere with the operation of the servers, and that interferes with other players' enjoyment of the game.
Cheaters can actually steal something of value. If they sell it (e.g. gold farming), that can affect in game economy, or more directly, affect the profitability of the company.

If players cheat and no one else notices but them, you probably don't care, let them have their fun. But if they cheat and stop paying you money it matters even if they don't bother anyone else. That might happen if they get bored because they've maxed out their account easily, or they get everything they need without having a subscription (e.g. with free account).

The interaction between cheaters and developers has been called an arms race. And there are a lot more players than developers. Developers can't really hope to keep up and close every possible issue. So at some point it becomes a cost benefit thing. There will always be some cheating. You'll want to hit the big ones, and pick your battles.

There are a number of aspects to consider:

Detection: what is a cheat? Maybe it is gaining XP or loot too quickly. Test for this on the fly by adding logic to the game server? Run metrics queries against the DB or event logs periodically?
Reporting: put something in the server logs; send an alert email; weekly report out of the metrics system?
Mitigation: take away what they gained? ban them (and lose their subscription money)? Reimburse other players that have been harmed?
Prevention: do your best to secure the attack points of your system; check all client requests for sanity; do summary level real time rate limiting (detects your own bugs cheaters might exploit, speed hacks, bots/farming, aim-bots...); don't trust the client

Because this is an arms race, the enemy will find the edges of your detection and prevention system. E.g. they will fake a head shot just often enough not to get caught; they will farm gold just below the detection rate; ... So what you as a developer need to do is decide what rate of cheating is acceptable, and meets the goals of not letting cheaters ruin the fun of your game, or make you broke. Some titles have capped progress per day.

I think one of best mitigation strategies is public shaming. It leaves cheaters thinking that "everyone" is watching them, and it lets non-cheaters see that you as a developer are paying attention. You can let players report on other players. Ban the egregious cheaters, especially if they are greifing other players. Of course, they will be back with a different email address if their goal in life is to cause trouble. But this is a slippery slope susceptible to gaming as well. If you provide a means for the community to use social pressure against perceived cheaters, it can also be exploited by cheaters for greifing. E.g. if you show the community the number of reports against a player, you might think it would highlight those that should be avoided. But some might consider it a badge of honor (among thieves), or worse will use it for extortion against unempowered innocents.

You will want some form of "ignore", however, that each player can apply to those they consider a cheater. It could be used to make sure a player never gets matched into a dungeon instance or PvP match with someone, or have to listen to their obnoxious chat. Ideally, it would stop them from interacting with your character at all, and make them invisible. Just imagine being in kindergarten, and all the other kids ignored you. You aren't kicking them out of the game, but almost. Again, this might be exploited. What if someone ignored every player that was better than them at PvP. It would artificially inflate their win rating, and your leaderboards would be unfair.

But let's talk about the technical aspects of cheat prevention. (Let's ignore server intrusion problems.) Ultimately, the way a player manipulates the system is through the messages their client sends to the server. If your client is bug free, and has not been tampered with, all is well. The messages are a result of a human operating the UI as the designers intended. The difference between two players is their skill and knowledge of the game. But how can the server be sure all is well. It can only look at the messages and try to differentiate between an untampered client and one that is tampered with or replaced with a script.

I'll post this and come back later with a discussion of different kinds of attacks and ways to deal with them.

Super hero Squad (our latest title) is now live

2011-05-15T01:14:00.000-07:00

Things have been quiet here because all my attention was focused on Super Hero Squad (www.heroup.com). It is a Marvel title developed at The Amazing Society in Seattle, a studio of Gazillion. It is a light weight MMO, uses the Unity graphics engine, Smartfox, Apache, some Java apps on the back end, and MySQL. It is shardless, and the architecture scales horizontally with the number of concurrent players, including the database. The back end components are loosely coupled based on JMS publish/subscribe.

It has definitely been a fun project, and I'm working with a team with lots of deep experience. Load is ramping up, but not yet near the load tests we ran ahead of time. So I'm paying attention, but not anxious about it.

Along the way, we found ways to ship early and still have a fun and stable game. But as with all MMO's that actually launch, there is a lot of work left to do when you are "done". The context switch is challenging right now to go from: "we have to ship; we are not going to do that", to "remember those things we cut to simplify things; its time to put them back on the table". Now we have the fun of changing things without breaking a running service. And monitoring and fixing the service cuts into development. So things slow down at the same time they get more reactionary.

Running branches for continuous publishing

2011-02-27T20:18:00.000-08:00

I am a very strong proponent of what are called running branches for development of software, and for the stabilization and publication of online games. One of the more important features of large scale online games is that they live a long time, and have new content, bug fixes and new features added over time. It is very difficult to manage that much change with a relatively large amount of code and content. And since you continue to develop more after any release, you will want your developers to be able to continue working on the next release while the current one is still baking in QA, and rolling toward production.

I will skip the obvious first step of making the argument that version control systems (aka source code change control, revision control) are a good idea. I like Perforce. It has some nice performance advantages over Subversion for large projects, and has recently incorporated ease of use features like shelving and sandbox development. I like to call the main line of development mainline. I also like to talk about the process of cutting a release and deploying it into production as a "train". It makes you think about a long slow moving object that is really hard to stop, and really difficult to add things to and practically impossible to pull out and pass. And if you get in the way, it will run you down, and someone will lose a leg. Plus it helps with my analogy of mainline and branch lines.

So imagine you are preparing your first release. You make a build called Release Candidate 1 (RC1), and hand it off to QA. You don't want your developers to go idle, so you have two choices, they can pitch in on testing, or they can start working on release 2. You will probably do a bit of each, especially early in the release cycle, since you often dig up some obvious bugs, and can keep all your developers busy fixing those. But at some point they will start peeling off and need something to do. So you sic them on Release 2 features, and they start checking in code.

Then you find a bug. A real showstopper. It takes a day to find and fix. Then you do another build and you have RC1.1. But you don't want any code from Release 2 that has been being checked in for several days. It has new features you don't want to release, and has probably introduced bugs of its own. So you want to use your change control system to make a branch. And this is where the philosophy starts. You either make a new branch for every release, or you make a single Release Candidate branch and for each release, branch on top of it.

Being prepared ahead of time for branching can really save you time, and confusion, especially during the high stress periods of pushing a release, or making a hotfix to production. So I'm really allergic to retroactive branching, where you only make a branch if you find a bug and have to go back a patch something.

Here's why: the build system has to understand where this code is coming from, or you will be doing a lot manual changes right when things are the most stressed. If you have already decided to make branches, you will also have your build system prepared and tested to know how to build off the branch. You will also have solved little problems like how to name versions, prepare unambiguous version strings so you can track back from a build to the source it came from, and many more little surprises.

The build system is another reason why I prefer running branches as opposed to a new branch per release. You don't have to change any build configuration when a new release comes along. The code for RC2 is going to be in exactly the same place as RC1. You just hit the build button. That kind of automation and repeatability is key to avoiding "little" mistakes. Like accidentally shipping the DB schema from last release, or wasting time testing the old level up mechanism, or missing the new mission descriptions.

And then there is the aesthetic reason. If you cut a branch for every release, your source control depot is going to start looking pretty ugly. You are planning on continuous release, right? Every month. After 5 years that would be 60 complete copies of the source tree. Why not just 2: ML and RC (and maybe LIVE, but let's save that for another time).

Finally, as a developer, if you are lucky enough to be the one making the hotfix, you will want to get a copy of the branch onto your machine. Do you really want another full copy for each release that comes along? Or do you just want to do an update to the one RC branch you've prepared ahead of time? It sure makes it easier to switch back and forth.

An aside about labels: You might argue you could label the code than went into a particular build, and that is a good thing. But one problem with labels that has always made me very nervous is that labels themselves are not change controlled. Someone might move a label to a different version of a file, or accidentally delete it or reuse it, and then you would lose all record of what actually went into a build. You can't do that with a branch. And if you tried, you would at least have the change control records to undo it.

One more minor thought: if you want to compare all the stuff that changed between RC1 and RC2, it is much easier to do in a running branch. You simply look at the file history on the RC branch and see what new stuff came in. To do that when using a branch per release requires a custom diff each time you want to know: e.g. drag a file from one branch onto the same file on the other. Pretty clumsy.

Also note that these arguments don't apply as well for a product that has multiple versions shipped and in the wild simultaneously. An online game pretty universally replaces the previous version with the new one at some point in time. The concurrency of their existence is only during the release process.

Summary:

You want to branch so you can stabilize without stopping ongoing work for the next release
You want a branch so you are ready to make hot fixes
You want a running branch so your build system doesn't have to get all fancy, and so your repo looks simpler.

I may revisit the topic of branching in the form of sandbox development which is useful for research projects and sharing between developers without polluting the mainline.

Topics are not Message Types

2011-01-16T00:50:00.000-08:00

I periodically have an unproductive conversation about how to use Topics/Categories vs how to use Message Types. Hopefully this time will be better.

Both things appear to be used to "subscribe", and both wind up filtering what a message handler has to process and gets to process. If they can be used for exactly the same purposes, it is "just" policy as to what you use each one for. That has to be wrong, otherwise there would not be *two* concepts. Tus, there has to be a useful distinction. So let's define what they are and what their responsibilities are.

First a definition or two:

Hierarchical: a name is defined hierarchically if the parent context is needed to ensure the child is distinct from children of other parents when the children have the same name. The parents provide the namespace in which the child is defined.
Orthogonal: names are independent of one another, like dimensions or axes in mathematics.

Categories are names (or numbers) that are used to decompose a stream of messages into groups. In JMS they are called Topics, but I'm going to avoid that term in case the implementation of Topics in JMS implies something I don't mean. A message is sent on, or "to" a single Category. A consumer subscribes to one or more Categories. Sophisticated message publish/subscribe or producer/consumer implementations can support wildcards or bitmasking to optimize subscription to large sets of Categories. (While not very germane to this discussion, I believe JMS can only have wildcards at the end of a Topic, and only at a dot that separates portions of the Topic. My view of wildcards and Category masking does not have that limitation. But that shouldn't affect my arguments.)

It is critical to have a mechanism that efficiently filters network messages so that a consuming process is not "bothered" by messages arriving that are immediately discarded. Running the TCP stack, for example, can wind up consuming large fractions of the CPU, and if the message is discarded, even after a simple inspection by your message framework, that is totally wasted processing. Further, if the messages are traveling over a low bandwidth link to a player, for example, it can badly affect their experience as it steals network resources from more important traffic. So we want the sender, or some intermediary to filter the messages earlier.

Early distributed simulation implementations (DIS) used multicast groups, and relied on the Network Interface hardware to filter out any messages in groups that the consumer had not subscribed to. Ethernet Multicast tends to broadcast all messages, and rely on the NIC of each host to inspect and filter unwanted messages. That is better than having the kernel do it. Switches get into the picture, but are very simplistic when it comes to multicast. When there are more than a few groups, switches and NICs will become promiscuous, and all messages get broadcast anyway, and wind up in each destination's kernel. They are filtered there, but much of the network stack has already executed. To get around that, physical network segmentation with intelligent bridges were built to copy a message from one segment to another. The bridge or rebroadcaster or smart-router would crack open each message and send it into another segment based on configuration, or a control protocol (subscription request messages).

Ancient history. However, it formed the origin of the concept of numeric Categories. A message is sent to a single Category. A consumer subscribes. The Channel/Category/Subscription manager maintains the declared connectivity and routes the messages.

So. Categories are used to optimize routing. They minimize the arrival of a message to a process. So far, this has nothing to do with what code is run when it arrives.

Message types are also names but are used to identify the meaning of a message; what the message is telling or requesting of the destination; what code should run when the message arrives (or what code should not run). Without a message type, there would be only one generic handler. In the old days, that master-handler would be a switch statement, branching on some field(s) of the message (lets call that field the message type, and be done with it).

There is some coded, static binding of a message type to a piece of code; the message handler. Handler X is for handling messages of type Y. A piece of code cannot process fields of a message different than what it was coded for. There is little reason to make that binding dynamic or data-driven. Static binding is "good". It leads to fewer errors, and those error can be caught much earlier in the development cycle. Distributed systems are hard. You don't really want to catch a message-to-code mismatch after you've launched. One way to think about this static binding is as a Remote Procedure Call. You are telling a remote process to run the code bound to message Y. In fact, you can simplify your life by making the handler have the same name as the message type, and not even register the binding.

A message can be sent to any Category regardless of the message's type. There is no checking in code that a choice is "legal". The Category can be computed, and the message is bound to that value dynamically. Instances of the same message type can be sent to one of any number of Categories. Consumers can subscribe to any Category whether they know how to process all the message types it contains or not.

So. Back to the distinction. When code is declared to be able to handle messages of type Y, that does not imply that all message instances of type Y should arrive at the process with that handler. You may want to do something like load balancing where half the messages of type Y go to one process, and the other half go to a tandem process. So message types are independent of Categories. The two concepts are orthogonal.

When a process is subscribed to a Category, there is no guarantee to the subscriber about the message types that a producer sends to that Category. It is easy to imagine a process receiving messages it does not know how to handle. The sender can't force the receiver to write code, but the sender can put any Category on a message it wants. So Categories are independent of message types. The two concepts are orthogonal.

Now. With respect to hierarchy. Message type names can be declared within a hierarchical namespace. That can be pretty useful. At the end of the day, however, they are simply some strings, or bit strings. In a sophisticated system that maps message types to message classes (code), the class hierarchy may mirror the type name hierarchy, and have interesting semantics (like a handler for a base message class being able to handle a derived message class). But mostly, message type name hierarchy is useful to avoid collisions.

In systems like JMS, Categories (Topics) are also hierarchical. This is also done to avoid collisions in the topic namespace, and for organization. But it is also useful for wildcard subscription.

Now "the" question: are Categories within the Message Type Hierarchy, or are Message Types within the Category hierarchy? Or are they orthogonal to one another? I submit that a message of a given type means the same thing no matter which Category it arrived on. Further, the same message type can be sent to any Category and a Category can transport any number of different message types.

Since there is only one message exchange system, Categories cannot be reused for two purposes without merging the message streams. That leads to inefficiency. If you reuse a message type name for two different purposes, you run the risk of breaking handler code with what appears to be a malformed message. That leads to crashes. You could permit that kind of reuse, and institute policy and testing to keep those things from mingling (e.g. reuse message types, but only on different topics), but it is a looming disaster. I would put in some coordination mechanism or name spacing to keep the mingling from happening at all.

So what are the consequences:

There is no need to include Category when registering a message handler.
Category subscription occurs separately from handler-to-message-type mapping, and affects the entire process.
There is no need to build a message dispatcher that looks at Categories.

Well. That was pretty long winded. For those of you still here, I have an analogy. I haven't thought it through a lot, but it looks like it fits (although it is about a pull system, not a push system). URLs. The hostname and domain name represent a hierarchical Category or Topic. The path portion is the message type and identifies the handler (web service), and is also hierarchical. You can host your web site on any host on any domain, and the functionality would be the same. You can host any web site on your host. You can host any number of web sites on your host, provided the paths don't collide. If they do collide, you are going to get strange behavior as links refer to the wrong services, or pass the wrong parameters. One would need more hierarchy. Or you don't host the colliding web sites together. You put them on different addresses. But the service code doesn't care what address you choose.

Unless you talk about virtual hosts, or virtual processes, multiple independent connections to the message system, thread-local subscriptions. You can do *anything* in software. But should you?

The Real Priorities of Online Game Engineering

2010-12-22T23:58:00.000-08:00

I was trying to communicate to management that server developers have different priorities than game developers. As a means to show the importance of laying in administrative infrastructure, and other software engineering "overhead", I put this list together. Hope it helps you to think about making the right investment in making the system sustainable, and make those points to the powers that be.

This is a list of priorities in absolute order of importance. While it is good to address all of them, if we don’t have one of the higher priority requirements solved to a reasonable degree, there is not much point in having the lower ones.

I made this to help us focus on what is important, what order to do things, and what we might cut initially. I’d love to debate this over lunch with anyone. I’m hoping others think of more of these kind of driving requirements.

Don’t get sued. In particular, always deal with child safety. We also need to abide by our IP contract obligations (sometimes including the shipping date). Better to not ship than get sued into oblivion or go to jail.
Protect the game’s reputation. Even if the game is awesome, if the public thinks it isn’t or the service is poor, then we lose. This is especially important early in the lifecycle. This implies not shipping too early.
Be able to collect money. Even if there is no game.
Be able to roll out a new version periodically. Even if the game is broken or not finished, this means we can fix it. This implies:

You can make a build
You can QA it
You can deploy it without destroying what is already there, or at least roll back

Support effort is sustainable. If the game and system are perfect, but it needs so much handholding that our staff burns out, or we don’t have the resources to extend it, we still fail. This implies lots of stuff:

It is stable enough that staff is not working night and day to hold its hand.
There is enough automated maintenance to limit busy work
There is enough automated logging, metrics and alarms to limit time spent hovering

The cost of operating is not excessive. I.e. it sits lightly enough on the hardware that we don’t need massive amounts, or exotic types. (Special warning to engineers: it is all the way down here before we start to care about performance. And the only reason to care about performance is operating cost.)
Enough players can connect. This implies lots of stuff:

The cluster hardware exists at all, the network is set up, etc
There is a web site
Key platform and login features exist
There are enough related server and game features

The server is sufficiently stable that players can remain connected long enough. This implies lots of stuff:

It stays up.
There are no experience-ruining bugs or tuning problems.
Not too much lost progress when things crash.
The load is not allowed to get too high (population caps)

This is probably about where we need to get before Closed Beta.

Revenues exceed cost of operation. And eventually, cost of development. This implies not shipping too late. Note that you don't *have* get to this point immediately. And that this is more important than having a fun game.
The game is fun. This implies so much stuff, I won’t write it all down. Note that the requirement to not ruin the game's reputation can move some of this stuff earlier. But don't fool yourself. If you are making money on a game that is not fun, is that bad? I'm sure you can think of some examples of this. Here are some server-specific implications:

You aren’t put in a login queue for too long. You don’t have trouble finding a good time to play.
You aren’t dropped out of the game too often.
The feeling of lag is not that bad.
You can find people to play with. It is an online game, after all.

Players cannot ruin one another’s fun. Note that making the game cheat proof is not the requirement here. The only reason you care about cheating is if other players perceive it badly enough (reputation), or if the players are keeping you from making money.

They cannot grief one another, especially newbs.
They cannot bring down the server
They cannot ruin the gameplay or economy, making swatches of gameplay pointless or boring.

The server can scale to very large numbers of players. This is the profit multiplier.

Be honest with yourself. Are you over engineering? Solving fun technical problems that don't actually address any of these Real Priorities? Doing things in the right order? Remember, as the online engineer, you represent these priorities to management. They may not (yet) understand why this order is important.

"Restraint in what you ask for is the key to success" - Arnold's Aphorism

2010-12-14T23:30:00.000-08:00

A good friend of mine sent me an email about our current sprint (we use a relatively formal Scrum process), encouraging us to define our stories such that the team could experience a success. He is also a great student of military history. I assumed the quote below was from some past general or philosopher. He claims not. So I've named it after him and am spreading the word.

"Restraint in what you ask for is the key to success" - Arnold's Aphorism

I can picture Napoleon fighting with himself over asking his men to achieve a little too much and fail. Or ask less and wind up with better morale. Applying this to software engineering teams makes a lot of sense. If you always ask for more than can be successfully completed, the team may feel unsuccessful/underachieving, and you may feel disappointed. When in fact, they are doing almost the same work whether you expected more or not.

Here's the thing, though. It's been said before. In all things, moderation. Time management, and interpersonal relationship books talk about expectation management. "Under promise and over deliver". The pessimist is never disappointed and often surprised. Live beneath your means. "Let go". Thou shalt not covet.

This is probably more true of things we ask of ourselves: get this, do that, convince them...

What struck me in the phrasing of this old-new aphorism is how clearly it shows that you have the *choice* in setting the expectations. And yet, that goal is the very thing that defines success. And for many goal-oriented folks like us, that success defines our happiness. Ergo, we choose to be happy. Or not.

But it requires self discipline. Not so much in the exercise of effort, but in the restraining of wanting.

You may have heard analyses of the great American marketing machine. It generates unrequited desires, while offering to fulfill them (for a price). The promise of happiness. Oddly, it is more commonly the restraint of those desires that leads to happiness, not the fulfillment.

I think you are going to see many more instances of this wisdom over then next while. See if you can't identify the aspect that you control.

Big load testing

2010-11-10T20:12:00.000-08:00

I am a real fan of using the real client to do load testing. Your QA engineers will spend a lot of time building regression tests that verify the behavior of the game is still the same and no new bugs have been introduced. That entails adding scripting or player behavior "simulation" to the game code, but also includes creating the scripts that test the functionality of the game. Those test cases are really important and ideally cover almost all of the game functionality. And they have to be kept up to date as the code in the game changes.

Why not reuse all that work to help load test the server? The scaffolding, client hooks, and test cases?

One of my favorite ways of doing this is to have a test driver that picks random test cases and throws them at the server as fast as possible. Even if the test case involves sleeping or waiting for something like the character walking across some area of the game, if you run enough of them at the same time, you can generate significant load. And it is going to be more realistic than any other kind of test prior to having zillions of real players. It also saves you from having to reproduce the protocol and behavior of the real client and maintain it as the game team evolves everything.

Why not? Even if you take the time to make a headless version of the client, it is probably going to be so resource heavy that you will have trouble finding enough machinery to really ramp up. Most games are designed to tick as fast as possible to give the best framerate, but a headless client doesn't draw, so that is a waste of CPU. Some games rely on the timing intrinsic in animations to control walk speed or action/reaction times for interactions. But you want to strip out as much content as possible to save memory. Clearly there is a bunch of work needed to reduce the footprint of even a headless client. But they really are useful.

One thing you can do to make them more useful is construct a mini server cluster and see how it stands up to as many clients as you can scavenge.

You can get hold of more hardware than you might think by "borrowing" it at night from the corporate pool of workstations. You will need permission, and you will want a fool proof packaging so your clients can be installed (and auto-updated) without manual intervention or a sophisticated user. There is nothing like a robot army to bring your server to its knees. IT doesn't like this idea very much because they like to use night time network bandwidth for doing backup and stuff.

Another important trick is to observe the *slope* of performance changes relative to the change in load you throw at the server. If the marginal effect (incremental server load divided by incremental client load) is > 1 you have a problem. Some people call this non-linear or non-scalable performance. Although, to be technical, it is non-unitary. Non-linear means it is even worse that y = a*x + b. E.g. polynomial (x^2), or exponential (y = a^x). Generally you can find the low hanging fruit pretty easily. If the first 500 connected clients caused a memory increase of 100 MB, but the second 500 caused consumed 200 MB you have a problem. Obviously this applies to CPU, bandwidth and latency. And don't forget to observe DB latency as you crank up both the number of clients and the amount of data already in the DB. You may have forgotten an index.

But you may still not have enough insight, even given all this. The next step could be what I call a light-weight replay client or a wedge-client. The idea is to instrument the headless client, or graphical client and record the parameters being passed into key functions like message send, or web service calls. You are inserting a wedge between the game code and the message passing code. The real client can then be used to create a log of all the interesting data that is needed to stress the server. You would then create a replay client that uses only the lower level libraries. It would read the logs, passing the recorded parameters into a generic function that reproduces the message traffic and web requests. It doesn't have to understand what it is doing. The next step is to replace the values of key parameters to simulate a variety of players. You could use random player ids, or spend some more time having the replay client understand the sequences of logs and server responses. E.g. it could copy a session ID from a server response into all further requests.

Since you are wedging into existing source code, this approach is way easier than doing a network level recording and playback. That would require writing packet parsing code, and creating a state machine to try to simulate what the real client was doing. Very messy.

You might still not be able to replay enough load. Perhaps you don't have enough front end boxes purchased yet, but you want to stress your core server. The DB or event processing system. We use a JMS bus (it is great for publish/subscribe semantics that allows for loose coupling between components) to tie most things together on the back end. We built a record/replay system that pulls apart the JMS messages and does parameters replacement much like the wedge client described above. It is pretty simple to simulate thousands of players banging away. Not every client event results in a back end event that affects the DB.

So what we are planning on doing is:
a) build a mini-cluster with just a few front end boxes
b) use QA's regression test cases to drive them to their knees looking for bad marginal resource usage
c) use wedge recordings and replay if needed for even more load on the front end boxes
d) use the JMS message replay system to drive the event system and DB to its knees, also looking for bad marginal usage.
e) do some shady arithmetic to convince ourselves that the simulated client count that resulted in X% utilization of our test cluster will allow us to get to our target client count in the remaining 100-X% utilization available and the new hardware we plan to have in production.

Use of public/private keys to avoid a race condition

2010-10-28T00:16:00.000-07:00

I inherited an interesting technique when I took over the server of SHSO. The use of signatures to avoid a race condition that can occur when a player is handed off between hosts. The use case is: the client requests that the Matchmaker connect them to a game "room" server that has a new game session starting. That room needs correct information about the player, what level they are, what they own, etc. How do you get it to the room before the client connects to that room and the data is needed? A couple of approaches:

You don't. You just leave the client connected but in limbo until their data arrives through the back end asynchronously. This is a pretty normal approach. Sometimes the data arrives before the client makes the connection. So you have to cover both of those cases.
You don't. You leave the client in limbo while the room fetches the data through the back end synchronously. This is also pretty normal, but blocks the thread the room is running on, which can suck, especially if other rooms on the same host and process also get blocked. Yes, you could multithread everything, but that is not easy (see my manifesto of multithreading!). Or you could create a little state machine that tries to remember what kind of limbo the client is in: just-connected, connected-with-data-fetch-in-progress, etc. Personally, I don't allow the processes that run the game logic to open a connection directly to the DB and do blocking queries. DB performance is pretty erratic in practice, and that makes for uneven user experience.
Or have the data arrive *with* the connection. From the client. Interesting idea. But can you trust the client? That is where signed data comes in.

A quick review of cryptography. Yes, I am totally oversimplifying it, and skipping a lot of the interesting bits, and optimization and stuff. However, it is fun to talk about...

A public/private key works by taking advantage of mathematics that is easy to compute in one direction, but really hard to compute in the other direction. The most common is factoring very large integers that are the product of two very large prime numbers. There is only one way to factor the product, but you have to compute and try pretty much every prime number up to the square root of the product, and that can take a looong time.

A public key can be used to encrypt plain text, and the private key is the only thing that can be used to unencrypt it. That means only the owner of the private key can read the plain text (including any else that had access to the public key).

On the other hand, a signature is created by *unencrypting* the plain text using the private key. The public key can then be used to *encrypt* the signature and test if the result equals the plain text again, thereby verifying the signature, and proving that the signature and plain text came from the owner of the private key exactly as they sent it.

Back to the story...the player data is signed using the private key by the Matchmaker and delivered to the client when the Matchmaker directs the player to the correct room. The client cannot tamper with their data without getting caught. The client then sends the data to the game room with the signature when it connects. The room server checks the signature using the public key, and can tell that the data is unmodified and came indirectly from the Matchmaker.

Why not just encrypt the data? The room server could have been set up to be able to unencrypt. Answer: The client wants to read and use its player's data. It wouldn't be able to do that if it were encrypted. And sending it twice (plain, and encrypted) is a waste.

One interesting thing to note is that the client never saw the public, nor the private key.

OK. I know it seems pretty wasteful to send all this data down to the client, just to have it sent back up to a different host in the server. After all, we are talking about bandwidth in and out of the data center, and more importantly, bandwidth in and out of the client's home ISP connection. Only half bad. The client needed the data anyway. It is not the greatest approach, but it is what we have. As Patton is quoted: A good solution applied with vigor now is better than a perfect solution applied ten minutes later.

BTW, another reason this was built this way was that originally the system couldn't predict which room the player would wind up in, so the client needed to bring their data with them.

And that is a segue to the topic of scaling across multiple data centers. It might start to make sense of that extra bandwidth question.

Architect it for Horizontal DB Scalability

2010-10-25T00:05:00.000-07:00

The performance of the database in an MMO tends to be the most common limiting factor in determining how many simultaneous players an individual Cluster can support. Beyond that number, the common approach is to start creating exact copies of the Cluster, and call them Shards or Realms. The big complaint about sharding is that two friends may not be able to play together if their stuff happens to be on different shards. Behind the scenes, what is going on is that their stuff is in separate database instances. Each shard only accesses one DB instance because the DB engine can only handle so much load.

There are a couple of approaches that can almost completely get rid of these limits, both of which depend on creating many DB instances, and routing requests to the right instance

The first is an "automated" version of a character transfer. When a player logs in, they are assigned to an arbitrary cluster of machines, and all their stuff is transferred to the DB for that cluster. The player has no idea which cluster they were connected to and they don't have names. There are a couple of problems with this:
1) you can't do this if you want the state of your world around the player to be dynamic and persist; the stuff you throw on the ground; whether that bridge or city has been burned. A player would be confused if each time they logged in, the dynamic state of the world was different. This isn't all that common these days, however. Interesting problem, though.
2) the player might not be able to find their friends. "Hey I'm beside the fountain that looks like a banana-slug. Yeah, me too. Well, I can't see you!"
You might be able to deal with this by automatically reconnecting and transferring the friends to the same cluster and DB, but that gets tricky. In the worst case, you might wind up transferring *everyone* to the same cluster if they are all friendly.

Another approach provides horizontal scalability and is one that doesn't assume anything about how you shard your world, do dungeon instancing, what DB engine you use, or many other complications. That is a nice property, and makes the DB system loosely coupled, and useful across a broad spectrum of large scale persistent online games.

What dynamic data are you persisting anyway? The stuff you want to come back after a power-failure. Most games these days only care about the state of the player, and their stuff. They may have multiple characters, lots of data properties, inventory of items, skills learned, quests in progress, friend relationships, ... If you sort through all your data, you'll find that you have "tuning" or design data that is pretty much static between releases. And you have data that is "owned" by a player.

To a DB programmer, that fact means that the bulk of the data can be indexed by the player_id for at least part of its primary key. So here is the obvious trick:
Put all data that belongs to a given player into a single DB. It doesn't matter which DB or how many there are. You have thousands or millions of players. Spreading them horizontally across a large number of DB instances is trivial. You could use modulus (even player_ids to the left, odd to the right). Better would be to put the first 100,000 players in the first DB, then the next 100,000 into a second. As your game gets more successful, you add new DB instances.

A couple of simple problems to solve:
1) you have to find the right DB. Given that any interaction a player has is with a game server (not the DB directly), your server code can compute which DB to consult based on the player_id it is currently processing. Once it decides, it will use an existing connection to the correct DB. (It is hard to imagine a situation where you would need to maintain connections to 100 DB instances, but that isn't really a very large number of file descriptors in any case.)
2) If players interact and exchange items, or perform some sort of "transaction", you have to co-persist both sides of the transaction, and the two players might be on different DB instances. It is easy to solve the transactional exchange of items using an escrow system. A third party "manager" takes ownership of the articles in the trade from both parties. Only when that step is complete, will the escrow object give the articles back to the other parties. The escrow object is persisted as necessary, and can pick up the transaction after a failure. The performance of this system is not great. But this kind of interaction should be rare. You could do a lot of this sort of trade through an auction house, or in-game email where ownership of an item is removed from a player and their DB and transferred to a whole different system.
3) High-speed exchange of stuff like hit points, or buffs, doesn't seem like the kind of thing players would care about if there was a catastrophic server failure. They care about whether they still have the sword-of-uberness, but not whether they are at full health after a server restart.

Some people might consider functional-decomposition to get better DB performance. E.g. split the DB's by their function: eCommerce, inventory, player state, quest state, ... But that only gets you maybe 10 or 12 instances. And the inventory DB will have half the load, making 90% of the rest of the hardware a waste of money. On the other hand, splitting the DB with data-decomposition (across player_id), you get great parallelism, and scale up the cost of your hardware linearly to the number of players that are playing. And paying.

Another cool thing about this approach is that you don't have to use expensive DB tech, nor expensive DB hardware. You don't have to do fancy master-master replication. Make use of the application knowledge that player to player interaction is relatively rare, so you don't need transactionality on every request. Avoid that hard problem. It costs money and time to solve.

There is a phrase I heard from a great mentor thirty years ago: "embarrassingly parallel". You have an incredible number of players attached, and most actions that need to be persisted are entirely independent. Embarrassing, isn't it?

Now your only problem is how big to make the rest of the cluster around this monster DB. Where is the next bottleneck? I'll venture to say it is the game design. How many players do you really want all stuffed into one back alley or ale house? And how much content can your team produce? If you admit that you have a fixed volume of content, and a maximum playable density of players, what then?

A lightweight MMO using web tech

2010-10-21T13:16:00.000-07:00

I've been really head's down on Super Hero Squad Online. It has been getting really great reviews, and is very fun to play. Beta is coming up, and we are expecting to go live early next year.

So I thought I'd try to cut some time free to record my thoughts about its architecture. But not today. We are getting ready for an internal release tomorrow.

But here are some topics I'd like to pursue:

scaling across multiple data centers
horizontal DB scalability
an approach that avoids back end race conditions when handing off a client between front end boxes
big load testing

Not seamless and not fun

2010-02-14T17:08:00.000-08:00

A lot of the techniques I've discussed have been intended to address the difficult challenges in creating seamless worlds: interest management, load balancing, coordinate systems, consistency, authoritative servers,... But I want to take a moment to say: just because you can, and perhaps have solved these hard technical problems doesn't make you successful. You can avoid the problems by making other game design choices (compromises?) like instancing.

In fact, something like instancing may be inevitable in your game, if you want it to be fun. For example, a fun quest might require no one else be able to spoil things by stealing kills or your hard won loot.

And even if you have a nice seamless world, and you've squeezed out difficult latency issues in the seamless areas, you can still screw up the game so badly people drop out. I played DDO the other day. Turbine is famous for having a pretty good architecture. But boy, is it a pain to play that game. Line up on the crate; hit the crate; walk forward; click on the coins; repeat; repeat; repeat. Line up on a monster, press the mouse and hold it, and hold it, and hold it. Wander around in circles in a dungeon and get lost. Run from one corner of the dungeon to the other pulling levers. Do a quest, report back to the same dude. Gah! Didn't they ask a newb if all that was fun? No wonder people are micro-trans-ing their way up levels.

The newb experience requires repeated entering of instances. It happens two or three times in the first 15 minutes. And it takes 20 seconds or so to load (I'm trying not to exaggerate). But 20 seconds in an "action" game is forever. Ok. Your server architecture and your game design require instances. Fine. But what did the newbs say about how fun the experience was. I, for one didn't like it. Did the developers sit around and say: well, tough, that's the way it has to work. Maybe. But you *can* fix the experience.

Have auto-loot turned on by default. Don't hide the loot in 50 crates; give out more on a kill or a quest completion. Pre-load the instances in the background whenever the player gets near the entrance. It doesn't have to be seamless, but at least *try* to make it fun. Or less teeth grating-ly aggravating. Sorry, I'm not patient enough for it to "start getting fun after level 20". Even if it is free.

--end first rant--

I heard someone the other day say "hey we can't give away the good stuff, the players should have to earn it (or pay for it)." Why? Give the newbs the good stuff right away. Let them get hooked. I think it was Warhammer 40k where the first mission let you play some of the best vehicles and blow a lot of stuff to scrap. Pretty fun. Then the story line took it away. But at least you knew it was going to be a fun game. After only 5 minutes.

Let's get creative. Just because there appear to be insurmountable limitations (instancing, giving away the good content...) doesn't mean they really *are* insurmountable.

Good luck out there.

Dynamic Load Balancing a Large Scale Online Game

2010-01-18T16:39:00.000-08:00

Why bother designing your server to support dynamic load balancing? You can load test, measure and come up with a static load balance at some point before going live, or periodically when live. But...

Your measurements and estimates will be wrong. Be honest, the load you simulated was at best an educated guess. Even a live Beta is not going to fully represent a real live situation.
Hardware specs change. It takes time to finish development, and who knows what hardware is going to be the most cost effective by the time you are done. You definitely don't want to have to change code to decompose your system a different way just because of that.
Your operations or data center may impose something unexpected, or may not have everything available that you asked for. You might think "throw more hardware at the problem". But if they are doing their jobs, they won't let you. And if you are being honest with yourself, you know that probably wouldn't have worked anyway.
Hardware fails. You may lose a couple of machines and not be able to replace them immediately. Even if you shut down, reconfigure, and restart a shard, the change to the load balance must be trivial and quick. The easiest way is to have the system itself adjust.
Your players are going to do many unexpected things. Like all rush toward one interesting location in the game. Maybe the designers choose to do this on purpose using a holiday event. Maybe they would really appreciate if your system could stand up to such a thing so they *could* please the players with such a thing.
The load changes in interesting sine waves. Late at night and during weekdays, the load will be substantially less than at peak times. That is a lot of hardware just idling. If your system can automatically migrate load to fewer machines, and give back leased machines (e.g. emergency overload hardware) you might be able to cut a deal with your hosting service to save some money. Anybody know whether the "cloud" services support this? What if you are supporting multiple titles whose load profiles are offset. You could reallocate machines from one to another dynamically.
Early shards tend to have quite a lot higher populations than newly opened ones. As incentives to transfer to new shards start having effect, hardware could be transferred so that responsiveness can remain constant while population and density changes.
The design is going to change. Both during development and as a result of tuning, patches and expansions. If you want to let the designers make the game as fun (and successful) as possible, you don't want to give them too many restrictions.
It may be painful to think about but your game is going to eventually wind down. There is a long tail of committed players that will stay, but the population will drop. If you can jettison unneeded hardware, you can save money and make that time more profitable. (And you should encourage your designers to support merging of shards.) I am convinced that there is a lot of money left on the table by games that prematurely close their doors.

So what can you you actually "balance"? You can't decompose a process, reallocate objects, computation and data structures between processes. Not without reprogramming. Or programming it in from the beginning. So load balancing entire processes is not likely to cut it.

The best way to do that is design for parallelism. Functional parallelism only goes so far. E.g. if you have only one process that deals with all banking, you can't split it when there is a run on the bank.

So what kinds of things are heavy users of resources? What things are relatively easy to decompose into lots of small bits (then recombine into sensibly sized chunks using load balancing)?

Here are some ideas:

Entities. Using interest management (discussed in other blog entries), an Entity can be located in any simulator process on any host. There are communication overheads to consider, but those are within the data center. If you are creative, many of the features that compose a server can be represented as an Entity, even though we often limit our thinking to them as game Entities. E.g. a quest, a quest party/group, a guild, an email, a zone, the weather, game and system metrics, ... And of course, characters, monsters, and loot. The benefit of making more things an Entity is that you can use the same development tools, DB representation, execution environment, scripting/behavior system and SDK, ... And of course, load balancing. There are usually a very large number of Entities in a game, making it pretty easy to find a good balance (e.g. bin-packing). Picking up an Entity is often a simple matter of grabbing a copy of its Properties (assuming you've designed your Entities like this to begin with; with load balancing in mind). This can be fast because Entities tend to be small. Another thing I like about Entity migration is that there are lots of times when an Entity goes idle, making it easy to migrate without a player being affected at all. Larger "units" of decomposition are likely to never be dormant, so when a migration occurs, players feel a lag.
Zones. This is a pretty common approach, often with a number of zones allocated to a single process on a host. As load passes a threshold, the zone is restarted on another simulator on another machine. This is a bigger chunk of migration than an Entity, and doesn't allow for an overload within one zone. The designers have to add game play mechanisms to discourage too much crowding together. The zone size has to be chosen appropriately ahead of time. Hopefully load-balancing-zone is not the same as game-play-zone, or the content team will really hate you. Can you imagine asking them to redesign and lay out a zone because there was a server overload?
Modules. You will decompose your system design into modules, systems, or functional units. Making the computation of each of these be able to be mapped requires little extra work. Although there are usually a limited number of systems (functional parallelism), and there is almost always a "hog" (See Amdahl's law). Extracting a Module and moving it requires quite a bit more unwiring than an Entity. Not my first choice. But you might rely on your fault tolerance system and just shut something down in one place, and have it restart elsewhere.
Processes. You may be in a position where your system cannot easily have chunks broken off and compiled into another process. In this case, only whole processes can be migrated (assuming they do not share memory or files). Process migration is pretty complicated and slow, given how much memory is involved. Again, your fault tolerance mechanism might help you. If you have enough processes that you can load balance by moving them around, you may also have a lot of overhead from things like messages crossing process boundaries (usually via system calls).
Virtual Machines. Modern data centers provide (for a price) the ability to re-host a virtual machine, even on the fly. Has anyone tested what the latency of this is? Seems like a lot of data to transmit. The benefit of this kind of thinking is that you can configure your shard in the lab without knowing how many machines you are going to have, and run multiple VM on a single box. But you can't run a single VM on multiple boxes. So you have that tradeoff of too many giving high overhead, and too few giving poor balancing options.

Remember, these things are different from one another:

Decomposition for good software engineering.
Decomposition for good parallel performance.
Initial static load balancing.
Creating new work on the currently least loaded machine.
Dynamically migrating work.
Dynamically migrating work from busy to less loaded machines.
Doing it efficiently, and quickly (without lags)
And having it be effective.

I think balancing load is a hard enough problem that it can't really be predicted and "solved" ahead of time. So I like to give myself as much flexibility ahead of time, and good tools. Even if you don't realize full dynamic migration at first, at least don't box yourself into a corner that requires rearchitecting.

(Real) Science IS political, sorry to disillusion you

2009-12-20T10:18:00.000-08:00

There is the big "climate gate" hoo-haw in the media right now. Reporters are acting surprised that some leading scientists were caught manipulating scientific literature to silence skeptics and dissenters. They convinced peers to knock skeptics' articles, journals to reject skeptics' papers, remove peers from paper review committees if they passed skeptics' papers, and even shut down journals that published dissenting views. Maybe even fudged their data.

Hey! Science is supposed to be awesome. It always eventually gets it right. Real science is about reproducible experiments and validated results. Actually, no, it is political. Like most other human endeavors.

Galileo and other historical scientists were shut down by their community. Granted, their scientific peers were not basing their views on empirical data. But many modern scientists still base their views on what they've been taught, not what they've measured. This is understandable, you have to stand on the shoulders of giants to have time to make advances.
Views are often validated by currently understood standards. If you plot the published speed of light against the year of publication, you will see sequences of flat spots where a value is almost identical to what was previously "measured". And then there is a jump; followed by another flat period of many years. Is this because they were using the same equipment and experimental procedure? Or because anyone that tried to publish a different "answer" was considered a skeptic and shut down? Again, this is understandable. Humans tend to try to be consistent, and not be antisocial and go against the crowd. It requires extra diligence to disprove a well respected master in their field. Like the great Einstein when quantum physics popped up. (Wait! Maybe God changes the real speed of light periodically as a joke!)
Scientists are funded. Based on whether they get published or referenced. Or if they agree with the "sponsoring" corporation. So the system manipulates them into agreeing with the crowd. But the underbelly of the scientific community is less pretty than what we might have thought of as this kind of indirect pressure.
Journals are funded. If the larger community of scientists don't subscribe to that journal offers, or schools or corporations pull their funding (or advertising!), a venue of dissent/questioning dies.
I've seen public "attacks" during a presentation of research. In the form of a "question". Being honest, these questions are self-aggrandizing. E.g. "what makes you think that you are right when I have already published the opposite". It can embarrass a young scientist and discourage them from disagreeing in the future. Only the thick-skinned "crazies" keep at it. Like Tesla.
I've been on paper review committees where papers are summarily discarded. There are so many that only one or two of the reviewers are assigned to read a given paper before the meeting. If they didn't understand it, or disagreed with the findings (based on their own experience/bias), it can get tossed very quickly. There are a lot to get through. Even when it is "marginal", the shepherding process can be taxing, discouraging a reviewer from volunteering. After all, they are contributing their expertise, but don't get their name on the paper. (Suggestion: maybe they should. If they pass something that proves incorrect, they lose points. As an incentive to get it right. Or would that discourage participation?). For "workshops" (not full Journals) an author's name is sometimes on the paper being reviewed, so their reputation is considered.
As science get more "fine", some experiments cannot be reproduced except on the original equipment or by the original experts. Think CERN and supercolliders. (Or cold fusion?) Who has another billion dollars just to *reproduce* a result? Unless the larger community thinks a result is hogwash and feels motivated to pool their resources and dump on the results. So who is going to disagree? It almost sounds like a religion at that point.

I bet you've done the same things. Maybe at work. Tried to shut down "the competition". Competing for attention, or a raise, or recognition... You know what would be better? Listen to those that disagree and make sure they know you have heard them. They are trying to make the best decisions they can given their background. No one tries to make dumb decisions. If they are wrong, I'm sure they would appreciate learning something they don't know. Or, maybe they have something to teach you.

Imagine! Learning something from someone that disagrees with you and you find irritating. A dissenter. A skeptic. Seems like those that shut down dissent are not just closed minded, but unwilling to learn. Such a scientist should be embarrassed for themselves. Isn't IDEAL science supposed to be about discovery? Too bad that in reality there is so little ideal science, and so much science influenced by the politics of "real-life science".

Data Driven Entities and Rapid Iteration

2009-12-08T13:21:00.000-08:00

It is clearly more difficult to develop content for an online game than a single player game. (For one, sometime entities you want to interact with aren't in your process). So starting with the right techniques and philosophies is critical. Then you need to add tools, a little magic and shake.

There are several hard problems you hit when developing and debugging an online game:

Getting the game to fail at all. A lot of times bugs are timing related. Of course, once you ship it, to players it will seem like it happens all the time.
Getting the same failure to happen twice is really hard. E.g. if the problem is caused by multiplayer interaction, how are you going to get all players or testers to redo exactly the same thing? And in the spirit of Hiesenbugs, if you attach a debugger or add logging, good luck getting it to fail under those new conditions
Testing a fix is really hard, because you want to get the big distributed system back into the original state and test your fix. Did you happen to snapshot that state?
Starting up a game can take a long time (content loading). Starting an online game takes even longer because it also includes deployment to multiple machines, remote restarting, loading some entities from a DB, logging in, ...
If you are a novice content developer plagued by such a bug or a guy in QA trying to create repro steps to go along with the bug report, it will probably end badly.

Consequently, what do you need to do to make things palatable?

Don't recompile your application after a "change". Doing that leads (on multiple machines) to shutdown, deploy, restart, "rewind" to the failure point. You'd like to have edit and continue of some sort. To do that, almost certainly you'd need a scripted language (or at least one that does just in time compilation, and understands edit and continue).
Don't even restart your application. Even if you can avoid recompilation, it can take a loooong time to load up all the game assets. Especially early in production, your pipeline may not support optimized assets (e.g. packed files). For a persistent world, there can be an awful lot of entities stored in the database to load and recreate. Especially if you are working against a live or a snapshot of a live shard. At the very least, only load the assets you need.
Yes, I'm talking about rapid iteration and hot loading. When you can limit most changes to data, there is no reason you can't leave everything running, and just load the assets that changed. In some cases when things change enough you might have to "reset" by clearing all assets from memory and reloading, but at least you didn't have to restart the server
Rapid iteration is particularly fun on consoles, which often have painfully long deployment steps. Bear in mind that you don't have an editor in-game on a console, it is too clumsy. So the content you see in your editor on your workstation is just a "preview". You would swivel-chair to the console to see what it really looks like on-target.
Try to make "everything" data driven. For example, you can specify the properties and behaviors of your entities in a tool and use a "bag of properties" kind of Entity in most cases. After things have settled down, you can optimize the most used ones, but during content development, there is a huge win to making things data-driven. Of course, there is nothing stopping you from doing both at once.
Another benefit of a data-driven definition of an Entity is that it is so much easier to extract information needed to automatically create a database schema. Wouldn't you rather build a tool to do this than to teach your content developers how to write SQL?

Don't forget that most of the time and money in game development pours through the content editor GUIs. The more efficient you can make that, the more great content will appear. If you want to hire the best content developers, make your content development environment better than anyone else's.

That's not an MMO, its a...

2009-11-19T13:00:00.000-08:00

Some people call almost anything "big" an MMO. Is facebook an MMO? Is a chess ladder an MMO? Is Pogo? Not to me.

What about iMob and all the art-swapped stuff by The Godfather on the iPhone? You have an account with one character. Your level/score persists, money, and which buttons you've press on the "quest" screen. As much as they want to call that an MMO, it is something else. Is it an RPG? Well, there is a character. But you don't even get to see him. Or are you a her?

These super-light-weight online games are not technically challenging. You can build one out of web-tech or some other transaction server. If you are all into optimizing a problem that scalable web services solved years ago, cool. Your company probably even makes money. But it doesn't push the envelope. Someone is going to eat your lunch.

Maybe I should have called this blog "Interesting Online Game Technologies".

Me? I want to build systems that let studios build the "hard" MMO's, like large seamless worlds. I don't want a landscape that only has WoW. If that tech were already available, we'd be seeing much lower budgets, more experimentation, games that are more fun, lower subscription fees, more diversity, and better content. All at the same time. I certainly don't want to build the underlying infrastructure over and over.

Of course, I'd love it if the tech solved all the hard problems so my team could build something truly advanced while still landing the funding. Unfortunately, today, people have to compromise. But maybe not for long.

What is software architecture (in 15 pages or less)?

2009-08-21T08:10:00.000-07:00

Kruchten's "4+1 views of software archicture" is one of my favorite papers of all time. It shows four views of software architecture (logical/functional, process, implementation/development, and physical). The plus one is use-cases/scenarios.

Being 15 pages long, it is an incredibly efficient use of your time if you want to disentangle the different aspects of designing complex systems. And it gives you terminology for explaining what view of the system you mean, and which you have temporarily abstracted away.

Don't get hung up on the UML terminology:

Logical is what is commonly refered to as a bubble diagram. What components are in your system and what are they responsible for?
Process is which OS processes/threads will be running and how they communicate.
Implementation is the files and libraries you used to build the system.
Physical is your hardware and where you decided to map your processes.

http://en.wikipedia.org/wiki/4%2B1_Architectural_View_Model
I prefer the original paper: http://www.cs.ubc.ca/~gregor/teaching/papers/4+1view-architecture.pdf