Tuesday, September 9, 2008

Failed Replicated Computing on The Sims Online

Another example of how Replicated Computing didn't work in a large scale client/server game...

The Sims content is implemented in a custom scripting language referred to as Edith script. We needed a way to migrate the tons of single-player script content online. Normally you would develop scripts for a client/server architecture by separating logical actions that need an authoritative result out from client actions that add decorative, interactive display. But the mass of existing single player content had them co-mingled.

The other aspect of gameplay was the user would select a game Entity and choose one of several actions that they wanted to perform.

Some of the lead engineers reasoned that we had identical initial state (in the form of a save file), we could route the events requested by a user through the server and have each client play the associated script to result in the same final state (rinse-repeat). Of course you couldn't play graphical actions on the server, so the idea was to make those script builtins nop's on the server, and only do something client-side. Since we had control of the script VM we should be able to make the computation deterministic. Right? Uh. No.

The first test of this approach resulted in drift within seconds. In a level that was empty. The character began choosing "fidget" actions randomly, and wound up heading in different directions. To synchronize the random number generators the seed had to start the same, making it now part of the initial state. But the number of calls to the generator was determined by frame rate, OS scheduling and other client-side environmental issues that couldn't be controlled.

So the slippery-slope began. We found butterfly effects all over the place. Actions were run in different orders. Action requests had side-effects before they were routed through the server. We didn't initially disable game-pause, and buffers backed up and overflowed. ...

The result was the design team could not work. They tried doing development single-player, but this was an online game. No online content was working. We built a manual resync mechanism so the playtesters could get a full state snapshot sent down from the server (ctrl-L; like in emacs!). And we noticed they would hit ctrl-L every 10 seconds "just in case". But that reset every client, and other playtesters got upset when their workflow was interrupted (every 10 seconds).

So we built an automatic resync that detected drift. But for large levels the state snapshot was bigger than the message system could handle. And on and on. Drift-fix. Sync-fix. Timing-fix. Side-effect fix...

We actually *shipped* with a resync that grabbed a state snapshot for each Entity involved in each action and applied it before each action was played out on the client. The only thing that allowed this to work was that the interaction rate was so much lower than a first person shooter that we didn't swamp the server to client network connection.

It was hard, time-consuming, and technically embarassing. So what is the "right" way to do it? More on that...

No comments:

Post a Comment