Friday, December 12, 2008

Single Writer Authoritative Simulator

We can now see how an Entity can have a local cache populated with all the remote Entity proxies it needs by using Interest Management and data distribution over a publish and subscribe message system.

One tough problem remains. What do you do about race conditions if two simulators modify the same Property on an Entity at around the same time? Lightly digging into this reveals a decent solution where such writes would be resolved at a central location and distributed from there so all consumers see the same ordering. However, allowing multiple writers to a single Property can lead to inconsistencies that need sophisticated transaction management.

The easier approach is to disallow multiple writers. Ensure that all properties of an Entity are modified only by that Entity. Any other Entity that wants to make a change must send a request. This boils down to "Communicating Sequential Processes", and is a well-understood computer science paradigm. Normally the Entity stays on one host for a good period of time, and the Entity is said to be Owned by the associated simulator.

The owning Simulator is said to be the authoritative simulator. All computation that affects that Entity is performed on the owning simulator. The values it computes are pushed out to other interested simulators where they become proxies/reflections/replicas, and are read-only.

The single writer paradigm allows a junior game content developer to remain blissfully unaware of concurrency. They think about one Entity at a time. An interaction with another Entity is not trying to read or write to a concurrently evolving set of variables. Instead it is sending a request to the other Entity, which will eventually get around to handling the request sequentially. The developer can think in single-threaded terms. Yay! In fact, the simulator is also made single-threaded so there can be no mistakes (note this still leaves ways to make good use of multiple cores).

The behavior that is running on an Entity is able to immediately read any of the Properties of Entities in which it has already expressed an interest. Since the simulator is single threaded, this can be done without locks. The properties of the proxies are only updated when the message system is ticked, and since the simulator is single-threaded, that is done after the Entities are done executing. Note that because we use state push, the property values of the proxies have the lowest latency *possible*. We can also apply latency hiding techniques to further improve the proxy's estimate of the value on the authoritative simulator.

All this results in a very accurate and familiar representation of a computing environment that appears to have all Entities on the same machine. But since it is actually distributed, its performance will scale. The distributed nature is abstracted away without impacting the developer.

If you are thinking about multi-entity transactions, you'll have to wait for it...

Monday, December 8, 2008

Publish/Subscribe Message Delivery

In a previous post, I argued that publish/subscribe was the only tricky thing needed for a totally flexible interest management based online game system.

Publish/subscribe (producer/consumer) based message systems give semantics similar to multicast. A producer sends a message to a channel. All current consumers on that channel receive a copy of that message. To avoid becoming broadcast (where every consumer receives every message sent), the messages are decomposed into channels using a Category, one per channel (so you can think of it as a channel id). A Category is an integer so they are trivial to deal with at the lower level (as opposed to strings or something). For simplicity, each message is only sent to one Category.

This system is very loosely coupled giving it a lot of flexibility and extensibility. A producer does not need to know the existence of any of the consumers. The set of consumers and their implementation can change without touching the producer. For example, a logging system could be attached to a channel without affecting the system, and would give good data for debugging.

To implement the publish/subscribe system efficiently, we must manage the producer and consumer subscription requests efficiently. Broadcasting that a consumer is interested in some Category to each producer is too inefficient. So we introduce the notion of a channel manager that keeps track of the interests of all producers and consumers.

The channel manager is responsible for redistributing each data message. A producer sends a message to the channel manager. The channel manager maintains the list of interested consumers, and forwards a copy of the producer's message to each consumer. We have exchanged the non-scalable broadcast of subscription messages for an extra hop of latency for each data message.

The channel manager can easily be made scalable. The simplest approach is to use the integer Category value and a simple a modulus operation to load balance across any number of channel manager processes. Both producers and consumers use the same computation. And all subscriber messages and all data messages on one Category travel through a single channel manager.

This architecture is the obvious one. There are more sophisticated approaches that can reduce the two hop latency by using a direct connections between producers and consumers. The subscription messages still need to route through a channel manager, but the producers need to maintain the list of interested consumers. This adds the requirement that producers subscribe to produce, and adds more subscription messages and more latency on a subscription. There are also subtle data message ordering problems.

If you want to go nuts, you could use real multicast. The challenge there is that there are limited numbers of multicast groups. So you have to solve the problem of multiple channels sharing one multicast group.

So you get to choose. Easy implementation or optimized but tricky implementation. Like most code. In this case I argue that the simple approach has good enough performance for the needs of online games. The producers and channel manager live in a data center on hosts attached to a high speed switch, so network latency is minuscule.

The design philosophy of this system is to minimize unnecessary computation due to unwanted messages arriving on a host that are just thrown away. Hosts cost money. Bandwidth inside the data center is free. So good interest management is key.

So. We have sliced off the publish/subscribe problem. All we have left is how to approach interest management policies which are application specific.

Wednesday, December 3, 2008

Where are memory buses going?

In the 80's and 90's almost all supercomputers were distributed memory systems. Hypercubes, meshes, and a very few SIMD machines. Any concept of a shared address space was simulated. If there was limited support for remote memory access it was through slow and complex transport systems (one example is the BBN Butterfly).

Recently we see desktop machines with "many" cores. For ease of use, these are symmetric multiprocessors. Each processor is equally able to access any address. The interconnect is referred to as a bus, even when not physically implemented that way. There are sophisticated cache coherency mechanisms and inter-processor synchronization instructions which "lock the bus", or invalidate remote cache lines to make it possible to have atomic access to a line of memory for at least one operation (e.g. atomic increment or swap).

But these approaches don't scale (in the computer science sense). Even a "bus" that is a token ring or other network-like transport can only scale so far. Maybe 32 processors. I've seen SGI Origin 2000 and Sun Dragon machines (admittedly in the late 90's) that scaled this large and were still (mostly) symmetric. They used what amounted to a packet switched network and distributed systems techniques to provide atomicity and coherency.

Regardless, the most efficient use of these machines determined by emperical study (and common sense) was to segregate the memory disjointely among the processors. This made the caches more effective since they accessed less of the address space (avoiding issues with Translation Lookaside Buffers), and avoided synchronization issues. One must keep in mind that doing an atomic operation even when there is no current contention can dramatically affect N-1 other processors because the operation flushes the bus or remote cache lines, etc. In the end, we tended to not make use of the symmetry aspects.

So people now talk a lot about Non Uniform Memory Access. For example, blocks of RAM are tightly associated with a processors or small # of cores, but there is also a "global" bus that allows access to the entire address space of the machine. So you have the appearance of a multiprocessor machine, but some addresses are a lot slower to access. The right way to use these machines is identical to what we used to do. Have disjoint blocks of memory per processor (or tightly couple set of cores).

What is interesting about this evolution is that you can see it is moving hardware architecture toward a distributed computing model. The memory "buses" themselves are networks that can have multiple in-flight packets containing memory access requests. Some will have routing or bridging between different disjoint buses/networks within the one machine. But to effectively use this architecture it must be programmed as a distributed system.

Fortunately, we know how to do that. You use a collection of processes (distinct address spaces), and pass messages (optimized to use the high speed memory access bus/net). Communicating Sequential Processes. The beauty here is that such a software system can much more easily be tuned and reconfigured than a "monolithic" multithreaded application as hardware specs change (more processors, different local/remote memory access speeds...).

If you step back another step and think about physics, you can also easily convince yourself that this evolution is permanent. How much compute power can fit into a cubic block of space? It is limited by distance, heat, complexity density... The only way to "grow" that computing power will eventually be increasing the amount of space consumed. In terms of computer science scalability (i.e. taking it to the extreme), space grows as N^3. So we can see that at best, communication speed (the distance/radis to the furthest part of the one computer) and delay, factoring in the speed of light, will grow linearly while computing power will grow as the cube. Thus the scaling eventually is dominated by the communication. Direct communication and *no* synchronization would give the best performance. So we can conclude that distributed memory systems connected with network (even if they act like memory buses) will provide optimal performance.

That is where we are going. I say develop our software with that in mind. Threading seems like a good idea, but it is really a cheap hack that allows two separate processes to have a bunch of shared datastructures. Eventually those shared datastructures will have to be synchronized over a longer distance, so lets start doing that (e.g. create duplicates, watch for edits and send out updates). Using application specific knowledge this can be done *much* more efficiently than a symmetric memory system can, which sends every changed byte and more.

The connection to online games? I've just described a distributed object replication system. And that is the basis of the scalable online game architecture I've been outlining the whole time.

Sunday, October 19, 2008

Back to First Principles: Interest Management

When you think of an Entity needing to interact with its environment, you don't tend to think about arbitrary lines running through the geometry. (There is no colored line on the ground between countries). In fact, players delight in trying to find those exceptions and do things while standing on either side, or jumping back and forth as fast as possible.

The way to think of the problem is that geometric decomposition is solely to support load balancing. Stuff on this side runs on this host, stuff on that side runs on that host. Much of the rest of the system just takes advantage of that assumption. (And it is not such a great assumption.)

But what if we ignore load balancing, and just think of the Entities all over the place trying to interact? At the extreme, each Entity would be on its own host. Now we have classical distributed systems problem, and can tap into that knowledge.

Distributed object technologies, like CORBA, hide the fact that some objects are remote by using a local smart-proxy. Interactions by a locally owned/executed Entity with the local proxy are forwarded to the remote-original object. The big problem here is that CORBA can block the requestor, and the request has a round-trip latency.

The better way to solve this is to ensure the local proxy is already up to date before the local Entity starts interacting. This allows the proxy's state to be as accurate as physically possible (the local proxy is at most out of date by a one-way network latency unit of time).

Now we have to solve the interest management problem. The system wouldn't scale if we broadcast Entity updates (in both network consumption and in space for storing the proxies). Here we rely on a few restrictions that we think are not too onerous. The Entity must declare what it is interested in, and it must never write directly to a local proxy.

The simplest interest management approach is to break the world into tiles. If an Entity can see into a tile at all, it is interested in all of that tile. If another Entity is currently located in that tile, it publishes its state updates to that tile. Using a publish-subscribe communication mechanism, all interested Entities consume every Entity's state that they can see. (There are much more interesting interest management approaches we will discuss later).

The result is, we don't have the nasty load balancing problems of other systems. The host on which an Entity is running doesn't matter. Two Entities can interact with each other no matter where they are hosted. And the simulation operates the same way it would in other systems.

The only remaining technical challenge is building a publish/subscribe system that is reliable and efficient.

Thursday, September 18, 2008

Forced immediate migration is nasty

Let's say we have a solution that somehow doesn't suffer from poor decomposition of space (too many small pieces, or no way to break the load into balanceable pieces -- i.e. having overloads). There is still another very difficult technical problem. When an Entity just crosses a geometric boundary, some of these systems will require the Entity to migrate onto the new host *immediately*. That is because there are assumptions built in elsewhere about where to look for an Entity, or how far an Entity can see/be seen.

The problem with a forced immediate migration is that an Entity might be working on something tricky just then. Some use cases and consequences:
  • Engaged in battle; delays/hitches during marshal/transmit/unmarshal during one of the most critical gameplay experiences
  • Running a script; how to pack up an in-flight Lua or Python script? Turns out Stackless Python supports in-flight script pickling. Another option is to write your own language, and build pickling in your Virtual Machine. Pretty complicated, and possibly slow. What about temporary or global variables; external references. I believe Eve does this, but I'm not sure if they do in-flight migration.
  • Being persisted to the DB, running a financial transaction; anything considered critical and fault sensitive should not be made even more complex by injecting a synchronous but distributed action. You are asking for deadlocks and race conditions in what is by definition the most critical aspect of the system.
There are two possible solutions that both allow a simpler migration system, but as a very beneficial side-effect allow Entities to interact across host boundaries:
  • Do everything event-oriented. This means that there are never any Behaviors outstanding at the end of ticking an Entity. When the migration service runs, each Entity has become just a set of Properties. The problem with this is that content developers find event oriented programming confusing and complicated. They have to explicitly manage a logical context (what is this Entity doing over a period of several events?), or fall back to a state-machine mechanism that adds a different kind of complexity (and more tools). To make it worse, you can still have race conditions (who opened that chest first, vs. who pulled out the loot?).
  • Don't migrate immediately. I think this is the silver bullet, and it is possible to realize. Even more interestingly, it is possible to *never* migrate, and that opens the door to using Entity Behavior technology that is not possible to migrate (e.g. C/C++ running on a Posix thread with pointers hanging out all over the place; computation that is only sensible to run on certain kinds of hardware). And the thought of not paying a migration performance penalty is kind of tantalizing. I'm going too far. In practice you would want to do the migration; there are tons of benefits.
Again, I'm going to make you wait a bit longer before I tell you how all this can work.

More on Geometric Decomposition

I need to amplify something in my previous comments about geometric decomposition. I said "there is a practical minimum size limit to even the dynamic splitting of a region. What does that mean?

Decomposition doesn't affect the distance that a character can see, or the number of other Entities that it can see or interact with. That is controlled by tuning and the game design. If someone wants every Entity in the game to be within 5 m of each other, then every Entity will see every other one.

However, that doesn't have anything to do with *load leveling*. Load leveling is the decision about where to execute the behavior for an Entity. There is no reason that the same host must execute all the Entities in one geometric area. There are many fairly easy ways to do that computation and get the same answer whether the two Entities are owned/executed on the same or different hosts.

In our flash-crowd example, we have actually made things worse by decomposing into smaller pieces of space, and mapping each to a different host. Now when an Entity moves just a little way, it may have to migrate to a new host because it cross a boundary. So our Entity migration rate goes through the roof, costing computation and communication overhead. I can show that it is unnecessary overhead, and unfortunately is applied at the point in the virtual world that is the most busy.

Someone may argue that Entities that are near one another *must* be on the same host to interact. That assumes they are directly reading *and* writing one another's state variables. The problem is that this approach precludes the direct interaction of any two Entities that are across a border (and maybe that border is shifting around if you do this dynamically). A designer, and a player wouldn't understand why there were some places they couldn't do some things. Poor ease of use, mental models of computation that are too complicated.

It turns out that there are pretty simple approaches that allow Entities to interact transparently across hosts. That fact makes all the difference.

Off Topic: Recycle Aluminum Cans?



A quick Google search will show a ton of pages that claim something like "recycling one aluminum can will save enough energy to power a 100 watt light bulb for almost 4 hours". That is actually an awful lot of energy. You could go burn your hand on it any time for 4 hours. Such statistics (all statistics?) make me go "what? really?".

Probably what they mean, since it would be the easiest to measure/compute is power-to-create-from-bauxite minus power-to-create-from-a-can. That doesn't even begin to answer the question.

What about mining and delivery of the bauxite, and other overheads on that end? What about the energy used to collect or separate the cans, and bring them to the plant? I assume things like cleaning and sterilizing are included in power-to-create-from-a-can. And I would discount the human cost of picking them out of the trash, or dropping them in the recycling bin.

Just try to find that information with Google! Or even determine which data is included in the quoted stat. The numbers and wording make me wonder whether all the pages are quoting each other. Maybe the first guy to say it made it up.

Here is a skeptic:
http://www.perc.org/pdf/ps28.pdf

Back to stuff I know something about...

Thursday, September 11, 2008

Geometric decomposition is a bad idea

The computational load in a game scales with the number of Entities. The more NPC's and players, the more the servers have to do.

Load balancing is a good thing because having idle hosts is a waste of money. You'd rather not have bought the hardware, and paying for power, A/C, and maintenance on an unused host is pointless. Ideally, you would have the same (full) load on each host.

Consequently, load balancing is all about mapping Entities to hosts. Dynamic load balancing is about migrating Entities to new hosts.

Many MMOs use the naive approach of decomposing their world into chunks/zones, and then mapping those to different server hosts to provide some load balancing. Smarter ones break the world into many more pieces than there are hosts, and rely on probabilities to provide some kind of load distribution. Really smart ones dynamically decompose and coalesce pieces of geometry as they fill up and empty out.

However, these systems will still face the "flash crowd" problem. "Hey, dudes, there's a blue dragon downtown! Let's go see Thresh get thrashed!". And suddenly 500 players are standing within 100m of each other. Even the dynamically adjusted systems have a lower limit on the useful decomposition of geometry. If you slice down to 10m pieces, you will still be interacting with all the pieces within 100m and all the hosts running them.

It would be better to load balance based on load. Distribute the Entities evenly among all available hosts. The challenge with this approach is how do you interact with Entities that are nearby in the game world if they are running off on some random host? More on that...

Tuesday, September 9, 2008

Failed Replicated Computing on The Sims Online

Another example of how Replicated Computing didn't work in a large scale client/server game...

The Sims content is implemented in a custom scripting language referred to as Edith script. We needed a way to migrate the tons of single-player script content online. Normally you would develop scripts for a client/server architecture by separating logical actions that need an authoritative result out from client actions that add decorative, interactive display. But the mass of existing single player content had them co-mingled.

The other aspect of gameplay was the user would select a game Entity and choose one of several actions that they wanted to perform.

Some of the lead engineers reasoned that we had identical initial state (in the form of a save file), we could route the events requested by a user through the server and have each client play the associated script to result in the same final state (rinse-repeat). Of course you couldn't play graphical actions on the server, so the idea was to make those script builtins nop's on the server, and only do something client-side. Since we had control of the script VM we should be able to make the computation deterministic. Right? Uh. No.

The first test of this approach resulted in drift within seconds. In a level that was empty. The character began choosing "fidget" actions randomly, and wound up heading in different directions. To synchronize the random number generators the seed had to start the same, making it now part of the initial state. But the number of calls to the generator was determined by frame rate, OS scheduling and other client-side environmental issues that couldn't be controlled.

So the slippery-slope began. We found butterfly effects all over the place. Actions were run in different orders. Action requests had side-effects before they were routed through the server. We didn't initially disable game-pause, and buffers backed up and overflowed. ...

The result was the design team could not work. They tried doing development single-player, but this was an online game. No online content was working. We built a manual resync mechanism so the playtesters could get a full state snapshot sent down from the server (ctrl-L; like in emacs!). And we noticed they would hit ctrl-L every 10 seconds "just in case". But that reset every client, and other playtesters got upset when their workflow was interrupted (every 10 seconds).

So we built an automatic resync that detected drift. But for large levels the state snapshot was bigger than the message system could handle. And on and on. Drift-fix. Sync-fix. Timing-fix. Side-effect fix...

We actually *shipped* with a resync that grabbed a state snapshot for each Entity involved in each action and applied it before each action was played out on the client. The only thing that allowed this to work was that the interaction rate was so much lower than a first person shooter that we didn't swamp the server to client network connection.

It was hard, time-consuming, and technically embarassing. So what is the "right" way to do it? More on that...

Monday, September 8, 2008

Replicated Computing (10k Archers)

It is essential for fairness that each player in a multiplayer game sees the same data. A game designer may choose to have some hidden state (see Game Theory), but all the public state must be kept in sync to have a fair shared experience. No matter whether latency is different for each player. No matter if they have different peak bandwidth available.

Some data doesn't matter and is only decorative. Where the gibs fall usually doesn't affect later gameplay. There is only a small loss of shared experience if one player experiences some awesome or amusing effect, but the others don't.

About 4 years ago I heard a GDC talk [reference] that explained how a Microsoft (I think) dev team built a multiplayer RTS and kept all the player's games in sync. They used "replicated computing". They assumed that two clients having an identical initial state, and applying a repeatable/deterministic operation/state change that it would result in both clients having the same resulting state. While this is true in computing theory, it is almost never true in real life.

Why?
* The state are *not* identical. The operation/event is *not* deterministic.
* The timing of the event is not the same (due to network latency issues), and somehow that timing affects the repeatability of the event (e.g. the event is applied during the next "turn" for one client).
* The machines have different processors. In particular, floating point processors do *not* always return the same results as one another. You can get different results when the computation happens in registers vs. in memory, since they tend to have more bits of precision in registers. This leads to the butterfly/chaos effect. A little drift, a little more, and suddenly you are talking about real money!
* Any interaction with an outside system (I/O, time of day, keyboard input, kernel operations...) can return radically different results on the two clients.
* (Pseudo) Random number generation sequences take on radically different values even if you only call it one extra time. Keeping the seeds in sync call by call is hugely expensive, and so is controlling the replicated execution so exactly the same number of calls.
* And many other reasons that are too painful to control.

And that is the moral of the story. They got it working (miraculously), but spent an admittedly *huge* amount of time finding all the reasons things would drift, and finding workarounds for them.

They argued that there was no way to synchronize the state of all the game Entities, because there were so many, and the state size would swamp the network.

Even so, nobody wants to do that kind of cleanup or heroic debugging effort each time you ship a game. And all your work is out the window if a novice script writer breaks some rules.

So what is a better way? More on that...

Online Game Techniques

I want to share some basic and advanced techniques that I favor for use in implementing online games. They apply generally to both small scale and large scale online games. The postings reflect my philosophy and I will try to include a justification of each approach.

I also want to techniques and best practices for parallel and distributed systems. It is intrinsically a very hard problem. Imposing some simple constraints can make such systems possible to implement and debug with a reasonable amount of effort, but also will make it possible to hide almost all the complexity from a user of the system. I want to make these systems approachable by novices but also appreciated by experts.