Thursday, September 18, 2008

Forced immediate migration is nasty

Let's say we have a solution that somehow doesn't suffer from poor decomposition of space (too many small pieces, or no way to break the load into balanceable pieces -- i.e. having overloads). There is still another very difficult technical problem. When an Entity just crosses a geometric boundary, some of these systems will require the Entity to migrate onto the new host *immediately*. That is because there are assumptions built in elsewhere about where to look for an Entity, or how far an Entity can see/be seen.

The problem with a forced immediate migration is that an Entity might be working on something tricky just then. Some use cases and consequences:
  • Engaged in battle; delays/hitches during marshal/transmit/unmarshal during one of the most critical gameplay experiences
  • Running a script; how to pack up an in-flight Lua or Python script? Turns out Stackless Python supports in-flight script pickling. Another option is to write your own language, and build pickling in your Virtual Machine. Pretty complicated, and possibly slow. What about temporary or global variables; external references. I believe Eve does this, but I'm not sure if they do in-flight migration.
  • Being persisted to the DB, running a financial transaction; anything considered critical and fault sensitive should not be made even more complex by injecting a synchronous but distributed action. You are asking for deadlocks and race conditions in what is by definition the most critical aspect of the system.
There are two possible solutions that both allow a simpler migration system, but as a very beneficial side-effect allow Entities to interact across host boundaries:
  • Do everything event-oriented. This means that there are never any Behaviors outstanding at the end of ticking an Entity. When the migration service runs, each Entity has become just a set of Properties. The problem with this is that content developers find event oriented programming confusing and complicated. They have to explicitly manage a logical context (what is this Entity doing over a period of several events?), or fall back to a state-machine mechanism that adds a different kind of complexity (and more tools). To make it worse, you can still have race conditions (who opened that chest first, vs. who pulled out the loot?).
  • Don't migrate immediately. I think this is the silver bullet, and it is possible to realize. Even more interestingly, it is possible to *never* migrate, and that opens the door to using Entity Behavior technology that is not possible to migrate (e.g. C/C++ running on a Posix thread with pointers hanging out all over the place; computation that is only sensible to run on certain kinds of hardware). And the thought of not paying a migration performance penalty is kind of tantalizing. I'm going too far. In practice you would want to do the migration; there are tons of benefits.
Again, I'm going to make you wait a bit longer before I tell you how all this can work.

More on Geometric Decomposition

I need to amplify something in my previous comments about geometric decomposition. I said "there is a practical minimum size limit to even the dynamic splitting of a region. What does that mean?

Decomposition doesn't affect the distance that a character can see, or the number of other Entities that it can see or interact with. That is controlled by tuning and the game design. If someone wants every Entity in the game to be within 5 m of each other, then every Entity will see every other one.

However, that doesn't have anything to do with *load leveling*. Load leveling is the decision about where to execute the behavior for an Entity. There is no reason that the same host must execute all the Entities in one geometric area. There are many fairly easy ways to do that computation and get the same answer whether the two Entities are owned/executed on the same or different hosts.

In our flash-crowd example, we have actually made things worse by decomposing into smaller pieces of space, and mapping each to a different host. Now when an Entity moves just a little way, it may have to migrate to a new host because it cross a boundary. So our Entity migration rate goes through the roof, costing computation and communication overhead. I can show that it is unnecessary overhead, and unfortunately is applied at the point in the virtual world that is the most busy.

Someone may argue that Entities that are near one another *must* be on the same host to interact. That assumes they are directly reading *and* writing one another's state variables. The problem is that this approach precludes the direct interaction of any two Entities that are across a border (and maybe that border is shifting around if you do this dynamically). A designer, and a player wouldn't understand why there were some places they couldn't do some things. Poor ease of use, mental models of computation that are too complicated.

It turns out that there are pretty simple approaches that allow Entities to interact transparently across hosts. That fact makes all the difference.

Off Topic: Recycle Aluminum Cans?

A quick Google search will show a ton of pages that claim something like "recycling one aluminum can will save enough energy to power a 100 watt light bulb for almost 4 hours". That is actually an awful lot of energy. You could go burn your hand on it any time for 4 hours. Such statistics (all statistics?) make me go "what? really?".

Probably what they mean, since it would be the easiest to measure/compute is power-to-create-from-bauxite minus power-to-create-from-a-can. That doesn't even begin to answer the question.

What about mining and delivery of the bauxite, and other overheads on that end? What about the energy used to collect or separate the cans, and bring them to the plant? I assume things like cleaning and sterilizing are included in power-to-create-from-a-can. And I would discount the human cost of picking them out of the trash, or dropping them in the recycling bin.

Just try to find that information with Google! Or even determine which data is included in the quoted stat. The numbers and wording make me wonder whether all the pages are quoting each other. Maybe the first guy to say it made it up.

Here is a skeptic:

Back to stuff I know something about...

Thursday, September 11, 2008

Geometric decomposition is a bad idea

The computational load in a game scales with the number of Entities. The more NPC's and players, the more the servers have to do.

Load balancing is a good thing because having idle hosts is a waste of money. You'd rather not have bought the hardware, and paying for power, A/C, and maintenance on an unused host is pointless. Ideally, you would have the same (full) load on each host.

Consequently, load balancing is all about mapping Entities to hosts. Dynamic load balancing is about migrating Entities to new hosts.

Many MMOs use the naive approach of decomposing their world into chunks/zones, and then mapping those to different server hosts to provide some load balancing. Smarter ones break the world into many more pieces than there are hosts, and rely on probabilities to provide some kind of load distribution. Really smart ones dynamically decompose and coalesce pieces of geometry as they fill up and empty out.

However, these systems will still face the "flash crowd" problem. "Hey, dudes, there's a blue dragon downtown! Let's go see Thresh get thrashed!". And suddenly 500 players are standing within 100m of each other. Even the dynamically adjusted systems have a lower limit on the useful decomposition of geometry. If you slice down to 10m pieces, you will still be interacting with all the pieces within 100m and all the hosts running them.

It would be better to load balance based on load. Distribute the Entities evenly among all available hosts. The challenge with this approach is how do you interact with Entities that are nearby in the game world if they are running off on some random host? More on that...

Tuesday, September 9, 2008

Failed Replicated Computing on The Sims Online

Another example of how Replicated Computing didn't work in a large scale client/server game...

The Sims content is implemented in a custom scripting language referred to as Edith script. We needed a way to migrate the tons of single-player script content online. Normally you would develop scripts for a client/server architecture by separating logical actions that need an authoritative result out from client actions that add decorative, interactive display. But the mass of existing single player content had them co-mingled.

The other aspect of gameplay was the user would select a game Entity and choose one of several actions that they wanted to perform.

Some of the lead engineers reasoned that we had identical initial state (in the form of a save file), we could route the events requested by a user through the server and have each client play the associated script to result in the same final state (rinse-repeat). Of course you couldn't play graphical actions on the server, so the idea was to make those script builtins nop's on the server, and only do something client-side. Since we had control of the script VM we should be able to make the computation deterministic. Right? Uh. No.

The first test of this approach resulted in drift within seconds. In a level that was empty. The character began choosing "fidget" actions randomly, and wound up heading in different directions. To synchronize the random number generators the seed had to start the same, making it now part of the initial state. But the number of calls to the generator was determined by frame rate, OS scheduling and other client-side environmental issues that couldn't be controlled.

So the slippery-slope began. We found butterfly effects all over the place. Actions were run in different orders. Action requests had side-effects before they were routed through the server. We didn't initially disable game-pause, and buffers backed up and overflowed. ...

The result was the design team could not work. They tried doing development single-player, but this was an online game. No online content was working. We built a manual resync mechanism so the playtesters could get a full state snapshot sent down from the server (ctrl-L; like in emacs!). And we noticed they would hit ctrl-L every 10 seconds "just in case". But that reset every client, and other playtesters got upset when their workflow was interrupted (every 10 seconds).

So we built an automatic resync that detected drift. But for large levels the state snapshot was bigger than the message system could handle. And on and on. Drift-fix. Sync-fix. Timing-fix. Side-effect fix...

We actually *shipped* with a resync that grabbed a state snapshot for each Entity involved in each action and applied it before each action was played out on the client. The only thing that allowed this to work was that the interaction rate was so much lower than a first person shooter that we didn't swamp the server to client network connection.

It was hard, time-consuming, and technically embarassing. So what is the "right" way to do it? More on that...

Monday, September 8, 2008

Replicated Computing (10k Archers)

It is essential for fairness that each player in a multiplayer game sees the same data. A game designer may choose to have some hidden state (see Game Theory), but all the public state must be kept in sync to have a fair shared experience. No matter whether latency is different for each player. No matter if they have different peak bandwidth available.

Some data doesn't matter and is only decorative. Where the gibs fall usually doesn't affect later gameplay. There is only a small loss of shared experience if one player experiences some awesome or amusing effect, but the others don't.

About 4 years ago I heard a GDC talk [reference] that explained how a Microsoft (I think) dev team built a multiplayer RTS and kept all the player's games in sync. They used "replicated computing". They assumed that two clients having an identical initial state, and applying a repeatable/deterministic operation/state change that it would result in both clients having the same resulting state. While this is true in computing theory, it is almost never true in real life.

* The state are *not* identical. The operation/event is *not* deterministic.
* The timing of the event is not the same (due to network latency issues), and somehow that timing affects the repeatability of the event (e.g. the event is applied during the next "turn" for one client).
* The machines have different processors. In particular, floating point processors do *not* always return the same results as one another. You can get different results when the computation happens in registers vs. in memory, since they tend to have more bits of precision in registers. This leads to the butterfly/chaos effect. A little drift, a little more, and suddenly you are talking about real money!
* Any interaction with an outside system (I/O, time of day, keyboard input, kernel operations...) can return radically different results on the two clients.
* (Pseudo) Random number generation sequences take on radically different values even if you only call it one extra time. Keeping the seeds in sync call by call is hugely expensive, and so is controlling the replicated execution so exactly the same number of calls.
* And many other reasons that are too painful to control.

And that is the moral of the story. They got it working (miraculously), but spent an admittedly *huge* amount of time finding all the reasons things would drift, and finding workarounds for them.

They argued that there was no way to synchronize the state of all the game Entities, because there were so many, and the state size would swamp the network.

Even so, nobody wants to do that kind of cleanup or heroic debugging effort each time you ship a game. And all your work is out the window if a novice script writer breaks some rules.

So what is a better way? More on that...

Online Game Techniques

I want to share some basic and advanced techniques that I favor for use in implementing online games. They apply generally to both small scale and large scale online games. The postings reflect my philosophy and I will try to include a justification of each approach.

I also want to techniques and best practices for parallel and distributed systems. It is intrinsically a very hard problem. Imposing some simple constraints can make such systems possible to implement and debug with a reasonable amount of effort, but also will make it possible to hide almost all the complexity from a user of the system. I want to make these systems approachable by novices but also appreciated by experts.