Monday, January 18, 2010

Dynamic Load Balancing a Large Scale Online Game

Why bother designing your server to support dynamic load balancing? You can load test, measure and come up with a static load balance at some point before going live, or periodically when live. But...
  • Your measurements and estimates will be wrong. Be honest, the load you simulated was at best an educated guess. Even a live Beta is not going to fully represent a real live situation.
  • Hardware specs change. It takes time to finish development, and who knows what hardware is going to be the most cost effective by the time you are done. You definitely don't want to have to change code to decompose your system a different way just because of that.
  • Your operations or data center may impose something unexpected, or may not have everything available that you asked for. You might think "throw more hardware at the problem". But if they are doing their jobs, they won't let you. And if you are being honest with yourself, you know that probably wouldn't have worked anyway.
  • Hardware fails. You may lose a couple of machines and not be able to replace them immediately. Even if you shut down, reconfigure, and restart a shard, the change to the load balance must be trivial and quick. The easiest way is to have the system itself adjust.
  • Your players are going to do many unexpected things. Like all rush toward one interesting location in the game. Maybe the designers choose to do this on purpose using a holiday event. Maybe they would really appreciate if your system could stand up to such a thing so they *could* please the players with such a thing.
  • The load changes in interesting sine waves. Late at night and during weekdays, the load will be substantially less than at peak times. That is a lot of hardware just idling. If your system can automatically migrate load to fewer machines, and give back leased machines (e.g. emergency overload hardware) you might be able to cut a deal with your hosting service to save some money. Anybody know whether the "cloud" services support this? What if you are supporting multiple titles whose load profiles are offset. You could reallocate machines from one to another dynamically.
  • Early shards tend to have quite a lot higher populations than newly opened ones. As incentives to transfer to new shards start having effect, hardware could be transferred so that responsiveness can remain constant while population and density changes.
  • The design is going to change. Both during development and as a result of tuning, patches and expansions. If you want to let the designers make the game as fun (and successful) as possible, you don't want to give them too many restrictions.
  • It may be painful to think about but your game is going to eventually wind down. There is a long tail of committed players that will stay, but the population will drop. If you can jettison unneeded hardware, you can save money and make that time more profitable. (And you should encourage your designers to support merging of shards.) I am convinced that there is a lot of money left on the table by games that prematurely close their doors.
So what can you you actually "balance"? You can't decompose a process, reallocate objects, computation and data structures between processes. Not without reprogramming. Or programming it in from the beginning. So load balancing entire processes is not likely to cut it.

The best way to do that is design for parallelism. Functional parallelism only goes so far. E.g. if you have only one process that deals with all banking, you can't split it when there is a run on the bank.

So what kinds of things are heavy users of resources? What things are relatively easy to decompose into lots of small bits (then recombine into sensibly sized chunks using load balancing)?

Here are some ideas:
  • Entities. Using interest management (discussed in other blog entries), an Entity can be located in any simulator process on any host. There are communication overheads to consider, but those are within the data center. If you are creative, many of the features that compose a server can be represented as an Entity, even though we often limit our thinking to them as game Entities. E.g. a quest, a quest party/group, a guild, an email, a zone, the weather, game and system metrics, ... And of course, characters, monsters, and loot. The benefit of making more things an Entity is that you can use the same development tools, DB representation, execution environment, scripting/behavior system and SDK, ... And of course, load balancing. There are usually a very large number of Entities in a game, making it pretty easy to find a good balance (e.g. bin-packing). Picking up an Entity is often a simple matter of grabbing a copy of its Properties (assuming you've designed your Entities like this to begin with; with load balancing in mind). This can be fast because Entities tend to be small. Another thing I like about Entity migration is that there are lots of times when an Entity goes idle, making it easy to migrate without a player being affected at all. Larger "units" of decomposition are likely to never be dormant, so when a migration occurs, players feel a lag.
  • Zones. This is a pretty common approach, often with a number of zones allocated to a single process on a host. As load passes a threshold, the zone is restarted on another simulator on another machine. This is a bigger chunk of migration than an Entity, and doesn't allow for an overload within one zone. The designers have to add game play mechanisms to discourage too much crowding together. The zone size has to be chosen appropriately ahead of time. Hopefully load-balancing-zone is not the same as game-play-zone, or the content team will really hate you. Can you imagine asking them to redesign and lay out a zone because there was a server overload?
  • Modules. You will decompose your system design into modules, systems, or functional units. Making the computation of each of these be able to be mapped requires little extra work. Although there are usually a limited number of systems (functional parallelism), and there is almost always a "hog" (See Amdahl's law). Extracting a Module and moving it requires quite a bit more unwiring than an Entity. Not my first choice. But you might rely on your fault tolerance system and just shut something down in one place, and have it restart elsewhere.
  • Processes. You may be in a position where your system cannot easily have chunks broken off and compiled into another process. In this case, only whole processes can be migrated (assuming they do not share memory or files). Process migration is pretty complicated and slow, given how much memory is involved. Again, your fault tolerance mechanism might help you. If you have enough processes that you can load balance by moving them around, you may also have a lot of overhead from things like messages crossing process boundaries (usually via system calls).
  • Virtual Machines. Modern data centers provide (for a price) the ability to re-host a virtual machine, even on the fly. Has anyone tested what the latency of this is? Seems like a lot of data to transmit. The benefit of this kind of thinking is that you can configure your shard in the lab without knowing how many machines you are going to have, and run multiple VM on a single box. But you can't run a single VM on multiple boxes. So you have that tradeoff of too many giving high overhead, and too few giving poor balancing options.
Remember, these things are different from one another:
  • Decomposition for good software engineering.
  • Decomposition for good parallel performance.
  • Initial static load balancing.
  • Creating new work on the currently least loaded machine.
  • Dynamically migrating work.
  • Dynamically migrating work from busy to less loaded machines.
  • Doing it efficiently, and quickly (without lags)
  • And having it be effective.
I think balancing load is a hard enough problem that it can't really be predicted and "solved" ahead of time. So I like to give myself as much flexibility ahead of time, and good tools. Even if you don't realize full dynamic migration at first, at least don't box yourself into a corner that requires rearchitecting.

Sunday, December 20, 2009

(Real) Science IS political, sorry to disillusion you

There is the big "climate gate" hoo-haw in the media right now. Reporters are acting surprised that some leading scientists were caught manipulating scientific literature to silence skeptics and dissenters. They convinced peers to knock skeptics' articles, journals to reject skeptics' papers, remove peers from paper review committees if they passed skeptics' papers, and even shut down journals that published dissenting views. Maybe even fudged their data.

Hey! Science is supposed to be awesome. It always eventually gets it right. Real science is about reproducible experiments and validated results. Actually, no, it is political. Like most other human endeavors.

  • Galileo and other historical scientists were shut down by their community. Granted, their scientific peers were not basing their views on empirical data. But many modern scientists still base their views on what they've been taught, not what they've measured. This is understandable, you have to stand on the shoulders of giants to have time to make advances.
  • Views are often validated by currently understood standards. If you plot the published speed of light against the year of publication, you will see sequences of flat spots where a value is almost identical to what was previously "measured". And then there is a jump; followed by another flat period of many years. Is this because they were using the same equipment and experimental procedure? Or because anyone that tried to publish a different "answer" was considered a skeptic and shut down? Again, this is understandable. Humans tend to try to be consistent, and not be antisocial and go against the crowd. It requires extra diligence to disprove a well respected master in their field. Like the great Einstein when quantum physics popped up. (Wait! Maybe God changes the real speed of light periodically as a joke!)
  • Scientists are funded. Based on whether they get published or referenced. Or if they agree with the "sponsoring" corporation. So the system manipulates them into agreeing with the crowd. But the underbelly of the scientific community is less pretty than what we might have thought of as this kind of indirect pressure.
  • Journals are funded. If the larger community of scientists don't subscribe to that journal offers, or schools or corporations pull their funding (or advertising!), a venue of dissent/questioning dies.
  • I've seen public "attacks" during a presentation of research. In the form of a "question". Being honest, these questions are self-aggrandizing. E.g. "what makes you think that you are right when I have already published the opposite". It can embarrass a young scientist and discourage them from disagreeing in the future. Only the thick-skinned "crazies" keep at it. Like Tesla.
  • I've been on paper review committees where papers are summarily discarded. There are so many that only one or two of the reviewers are assigned to read a given paper before the meeting. If they didn't understand it, or disagreed with the findings (based on their own experience/bias), it can get tossed very quickly. There are a lot to get through. Even when it is "marginal", the shepherding process can be taxing, discouraging a reviewer from volunteering. After all, they are contributing their expertise, but don't get their name on the paper. (Suggestion: maybe they should. If they pass something that proves incorrect, they lose points. As an incentive to get it right. Or would that discourage participation?). For "workshops" (not full Journals) an author's name is sometimes on the paper being reviewed, so their reputation is considered.
  • As science get more "fine", some experiments cannot be reproduced except on the original equipment or by the original experts. Think CERN and supercolliders. (Or cold fusion?) Who has another billion dollars just to *reproduce* a result? Unless the larger community thinks a result is hogwash and feels motivated to pool their resources and dump on the results. So who is going to disagree? It almost sounds like a religion at that point.

I bet you've done the same things. Maybe at work. Tried to shut down "the competition". Competing for attention, or a raise, or recognition... You know what would be better? Listen to those that disagree and make sure they know you have heard them. They are trying to make the best decisions they can given their background. No one tries to make dumb decisions. If they are wrong, I'm sure they would appreciate learning something they don't know. Or, maybe they have something to teach you.

Imagine! Learning something from someone that disagrees with you and you find irritating. A dissenter. A skeptic. Seems like those that shut down dissent are not just closed minded, but unwilling to learn. Such a scientist should be embarrassed for themselves. Isn't IDEAL science supposed to be about discovery? Too bad that in reality there is so little ideal science, and so much science influenced by the politics of "real-life science".

Tuesday, December 8, 2009

Data Driven Entities and Rapid Iteration

It is clearly more difficult to develop content for an online game than a single player game. (For one, sometime entities you want to interact with aren't in your process). So starting with the right techniques and philosophies is critical. Then you need to add tools, a little magic and shake.

There are several hard problems you hit when developing and debugging an online game:
  • Getting the game to fail at all. A lot of times bugs are timing related. Of course, once you ship it, to players it will seem like it happens all the time.
  • Getting the same failure to happen twice is really hard. E.g. if the problem is caused by multiplayer interaction, how are you going to get all players or testers to redo exactly the same thing? And in the spirit of Hiesenbugs, if you attach a debugger or add logging, good luck getting it to fail under those new conditions
  • Testing a fix is really hard, because you want to get the big distributed system back into the original state and test your fix. Did you happen to snapshot that state?
  • Starting up a game can take a long time (content loading). Starting an online game takes even longer because it also includes deployment to multiple machines, remote restarting, loading some entities from a DB, logging in, ...
  • If you are a novice content developer plagued by such a bug or a guy in QA trying to create repro steps to go along with the bug report, it will probably end badly.
Consequently, what do you need to do to make things palatable?
  • Don't recompile your application after a "change". Doing that leads (on multiple machines) to shutdown, deploy, restart, "rewind" to the failure point. You'd like to have edit and continue of some sort. To do that, almost certainly you'd need a scripted language (or at least one that does just in time compilation, and understands edit and continue).
  • Don't even restart your application. Even if you can avoid recompilation, it can take a loooong time to load up all the game assets. Especially early in production, your pipeline may not support optimized assets (e.g. packed files). For a persistent world, there can be an awful lot of entities stored in the database to load and recreate. Especially if you are working against a live or a snapshot of a live shard. At the very least, only load the assets you need.
  • Yes, I'm talking about rapid iteration and hot loading. When you can limit most changes to data, there is no reason you can't leave everything running, and just load the assets that changed. In some cases when things change enough you might have to "reset" by clearing all assets from memory and reloading, but at least you didn't have to restart the server
  • Rapid iteration is particularly fun on consoles, which often have painfully long deployment steps. Bear in mind that you don't have an editor in-game on a console, it is too clumsy. So the content you see in your editor on your workstation is just a "preview". You would swivel-chair to the console to see what it really looks like on-target.
  • Try to make "everything" data driven. For example, you can specify the properties and behaviors of your entities in a tool and use a "bag of properties" kind of Entity in most cases. After things have settled down, you can optimize the most used ones, but during content development, there is a huge win to making things data-driven. Of course, there is nothing stopping you from doing both at once.
  • Another benefit of a data-driven definition of an Entity is that it is so much easier to extract information needed to automatically create a database schema. Wouldn't you rather build a tool to do this than to teach your content developers how to write SQL?
Don't forget that most of the time and money in game development pours through the content editor GUIs. The more efficient you can make that, the more great content will appear. If you want to hire the best content developers, make your content development environment better than anyone else's.

Thursday, November 19, 2009

That's not an MMO, its a...

Some people call almost anything "big" an MMO. Is facebook an MMO? Is a chess ladder an MMO? Is Pogo? Not to me.

What about iMob and all the art-swapped stuff by The Godfather on the iPhone? You have an account with one character. Your level/score persists, money, and which buttons you've press on the "quest" screen. As much as they want to call that an MMO, it is something else. Is it an RPG? Well, there is a character. But you don't even get to see him. Or are you a her?

These super-light-weight online games are not technically challenging. You can build one out of web-tech or some other transaction server. If you are all into optimizing a problem that scalable web services solved years ago, cool. Your company probably even makes money. But it doesn't push the envelope. Someone is going to eat your lunch.

Maybe I should have called this blog "Interesting Online Game Technologies".

Me? I want to build systems that let studios build the "hard" MMO's, like large seamless worlds. I don't want a landscape that only has WoW. If that tech were already available, we'd be seeing much lower budgets, more experimentation, games that are more fun, lower subscription fees, more diversity, and better content. All at the same time. I certainly don't want to build the underlying infrastructure over and over.

Of course, I'd love it if the tech solved all the hard problems so my team could build something truly advanced while still landing the funding. Unfortunately, today, people have to compromise. But maybe not for long.

Reblog this post [with Zemanta]

Friday, August 21, 2009

What is software architecture (in 15 pages or less)?

Kruchten's "4+1 views of software archicture" is one of my favorite papers of all time. It shows four views of software architecture (logical/functional, process, implementation/development, and physical). The plus one is use-cases/scenarios.

Being 15 pages long, it is an incredibly efficient use of your time if you want to disentangle the different aspects of designing complex systems. And it gives you terminology for explaining what view of the system you mean, and which you have temporarily abstracted away.

Don't get hung up on the UML terminology:
  • Logical is what is commonly refered to as a bubble diagram. What components are in your system and what are they responsible for?
  • Process is which OS processes/threads will be running and how they communicate.
  • Implementation is the files and libraries you used to build the system.
  • Physical is your hardware and where you decided to map your processes.
http://en.wikipedia.org/wiki/4%2B1_Architectural_View_Model
I prefer the original paper: http://www.cs.ubc.ca/~gregor/teaching/papers/4+1view-architecture.pdf

Monday, August 10, 2009

Fail over is actually kind of hard (and expensive)

You are building a peer to peer or peered server small scale online game. There are players that purposely disconnect as soon as someone start beating them. They think their meta score will be better, or whatever. Or maybe they have a crappy internet connection. In any case, the master simulator/session host goes down abruptly; now what?

At least you would want to keep playing with the guys you've matched with; the rest of your clan or whatever. At best, you'd like to keep playing from exactly the point of the "failure" without noticing any hiccup (good luck). The steps to get session host migration/fail over to a new simulator:
  • Restablish network interconnection
  • Pick a new master simulator
  • Coordinate with the Live Service about who is the master (or do this first)
  • Have the entity data already there, or somehow gather it
  • Convert replicated entities in to real owned/master entities
  • Reestablish interest subscriptions, or whatever distributed configuration settings are needed
  • "Unpause"
Some challenges:
  • To reconnect, someone needs the complete list of IP/ports used by the players. But that is consider a security issue. E.g. someone could use that info to DOS attack an opponent. Let's assume the Live Service renegotiates and handshakes your way back into business.
  • If you aren't connected, how do you elect a new master? If you don't have a master yet, how would all the clients (in a strict client/server network topology) know who to connect to? So the answer has to be precomputed. E.g. designate a backup simulator before there is a fault (maybe the earliest joiner, lowest ip address...)
  • If your game session service supports this, it can solve both of the previous issues by exposing IP addresses only to the master simulator, and since it has a fixed network address, each client can always make a connection to it and be told who is the new master.
  • If the authoritative data is lost on the fault, you may as well restart the level, go back to the lobby or whatever. So instead, you have to send the entity state data to the backup simulator(s) as you go. This is actually more data than is necessary to exchange for an online game that is not fault tolerant, since you'd have to send hidden and possibly temporary data for master Entities. Otherwise you couldn't completely reconstruct the dead Entities. There may be data that only existed on the master, so gathering it from the remaining clients isn't going to be a great solution. Spreading the responsibility is that much more complicated.
  • So the backup master starts converting replicated Entities into authoritative Entities. Any Entities it didn't know about couldn't get recreated, so the backup master has to have a full set of Entities. Think about the bandwidth of that. You should really want this feature before just building it. Now we hit a hard problem. If the Entities being recreated had in-flight Behaviors (e.g. you were using coroutines to model behavior), they can't be reconstructed. It is prohibitively expensive to continuously replicate the Behavior execution context. So you wind up "resetting" the Entities, and hoping their OnRecreate behavior can get it running again. You may have a self-driven Behavior that reschedules itself periodically. Something has to restart that sequence. Another thing to worry about: did the backup simulator have a truly-consistent image of the entity states, or was anything missing or out of order? At best this is an approximation of the state on the original session host.
  • Unless you are broadcasting all state everywhere, you are going to have to redo interest management subscriptions to realize bandwidth limitation. This is like a whole bunch of late-joining clients coming in. They would get a new copy of each entity state. Big flurry of messages, especially if you do this naively.
  • Now you are ready to go. Notify the players, give them a count-down...FIRE!
What did we forget? What defines "dropped out"; a maximum unresponsiveness time? What if the "dead" simulator comes back right then? What if the network "partitioned"? Would you restart two replacement sessions? Do you deal with simultaneous dropouts (have you ever logged out when the server went down? I have.)?

Note that the problem gets a lot easier if all you support is clean handoff from a running master to the new master. Would that be good enough for your game.

So is it worth the complexity, the continuous extra bandwidth and load on the backup simulator? Just to get an approximate recreation? With enough work, and game design tweaking, you could probably get something acceptable. Maybe give everyone a flash-bang to mask any error.

Or maybe you just reset the level, or go back to the lobby to vote on the next map. And put the bad player on your ban list.

Me? I'd probably instead invest the time of my network and simulator guys in something else, like smoothness, fun gameplay, voice, performance. Or ship earlier.

Thursday, July 30, 2009

Incremental release vs. sequel

(Warning: plenty of irony below, as usual)

There was a time when the MMO developer community thought that the ideal was to stand up your world, and then start feeding the dragon. As quickly as possible, get new content into the players hands. The more new content, the more fascinated they would be, the stickier their subscriptions would be and the more money you would make.

So we put a lot of effort into techniques to manage continuous development of content, test it, and roll it out with the minimum possible maintenance window. Some got good enough they could release content or patches every week. Didn't we have automated client patchers? Why not use those to continuously deliver content. Not just streaming content as you move around in virtual space, but as you move forward in real time.

Then someone noticed that Walmart took their game box off the shelf because it had been there for a year, and new titles were showing up. Surely consumers want the new stuff more? Besides, you don't need the latest release, you get all the new stuff when you patch. Then new subscriptions drop because of that lack of visibility at retail. Why would Gamestop sell prepaid cards if it is so easy to pay online?

So the light goes bing, and it is suddenly obvious that sequels would be a much better approach, since you'll get shelf space if it is a new SKU. Clearly this is the best approach, since it works so well for Blizzard. (So clearly, I cannot choose the glass in front of you. Where was I?) All you have to do is patch a few bugs, and set up a parallel team to work on the sequel.

But then everyone piles onto Steam. Definitely the end of brick and mortar. Maybe we go back to the low-latency content pipeline so our game is fresher than the sequel-only guys. But wait, Steam sales and free trials increase traffic at Gamestop.

Clearly, I cannot choose the glass in front of me... Wait til I get started.

Not really. As you can see, the point is that technologists are very unlikely to see the future of sales and distribution mechanisms. And if we did, it would take a year to adapt our development practices and product design to take optimal advantage of it.

The answer? Be flexible. Don't assume you've got the one and only magic bullet. Requirements change. And for an MMO the development timespan is large enough that a lot of things will change before you are finished. Don't implement your company into a corner with a big complex optimal single point solution, and keep your mind open.