Online Game Techniques: 2010

Wednesday, December 22, 2010

The Real Priorities of Online Game Engineering

I was trying to communicate to management that server developers have different priorities than game developers. As a means to show the importance of laying in administrative infrastructure, and other software engineering "overhead", I put this list together. Hope it helps you to think about making the right investment in making the system sustainable, and make those points to the powers that be.

This is a list of priorities in absolute order of importance. While it is good to address all of them, if we don’t have one of the higher priority requirements solved to a reasonable degree, there is not much point in having the lower ones.

I made this to help us focus on what is important, what order to do things, and what we might cut initially. I’d love to debate this over lunch with anyone. I’m hoping others think of more of these kind of driving requirements.

Don’t get sued. In particular, always deal with child safety. We also need to abide by our IP contract obligations (sometimes including the shipping date). Better to not ship than get sued into oblivion or go to jail.
Protect the game’s reputation. Even if the game is awesome, if the public thinks it isn’t or the service is poor, then we lose. This is especially important early in the lifecycle. This implies not shipping too early.
Be able to collect money. Even if there is no game.
Be able to roll out a new version periodically. Even if the game is broken or not finished, this means we can fix it. This implies:

You can make a build
You can QA it
You can deploy it without destroying what is already there, or at least roll back

Support effort is sustainable. If the game and system are perfect, but it needs so much handholding that our staff burns out, or we don’t have the resources to extend it, we still fail. This implies lots of stuff:

It is stable enough that staff is not working night and day to hold its hand.
There is enough automated maintenance to limit busy work
There is enough automated logging, metrics and alarms to limit time spent hovering

The cost of operating is not excessive. I.e. it sits lightly enough on the hardware that we don’t need massive amounts, or exotic types. (Special warning to engineers: it is all the way down here before we start to care about performance. And the only reason to care about performance is operating cost.)
Enough players can connect. This implies lots of stuff:

The cluster hardware exists at all, the network is set up, etc
There is a web site
Key platform and login features exist
There are enough related server and game features

The server is sufficiently stable that players can remain connected long enough. This implies lots of stuff:

It stays up.
There are no experience-ruining bugs or tuning problems.
Not too much lost progress when things crash.
The load is not allowed to get too high (population caps)

This is probably about where we need to get before Closed Beta.

Revenues exceed cost of operation. And eventually, cost of development. This implies not shipping too late. Note that you don't *have* get to this point immediately. And that this is more important than having a fun game.
The game is fun. This implies so much stuff, I won’t write it all down. Note that the requirement to not ruin the game's reputation can move some of this stuff earlier. But don't fool yourself. If you are making money on a game that is not fun, is that bad? I'm sure you can think of some examples of this. Here are some server-specific implications:

You aren’t put in a login queue for too long. You don’t have trouble finding a good time to play.
You aren’t dropped out of the game too often.
The feeling of lag is not that bad.
You can find people to play with. It is an online game, after all.

Players cannot ruin one another’s fun. Note that making the game cheat proof is not the requirement here. The only reason you care about cheating is if other players perceive it badly enough (reputation), or if the players are keeping you from making money.

They cannot grief one another, especially newbs.
They cannot bring down the server
They cannot ruin the gameplay or economy, making swatches of gameplay pointless or boring.

The server can scale to very large numbers of players. This is the profit multiplier.

Be honest with yourself. Are you over engineering? Solving fun technical problems that don't actually address any of these Real Priorities? Doing things in the right order? Remember, as the online engineer, you represent these priorities to management. They may not (yet) understand why this order is important.

Tuesday, December 14, 2010

"Restraint in what you ask for is the key to success" - Arnold's Aphorism

A good friend of mine sent me an email about our current sprint (we use a relatively formal Scrum process), encouraging us to define our stories such that the team could experience a success. He is also a great student of military history. I assumed the quote below was from some past general or philosopher. He claims not. So I've named it after him and am spreading the word.

"Restraint in what you ask for is the key to success" - Arnold's Aphorism

I can picture Napoleon fighting with himself over asking his men to achieve a little too much and fail. Or ask less and wind up with better morale. Applying this to software engineering teams makes a lot of sense. If you always ask for more than can be successfully completed, the team may feel unsuccessful/underachieving, and you may feel disappointed. When in fact, they are doing almost the same work whether you expected more or not.

Here's the thing, though. It's been said before. In all things, moderation. Time management, and interpersonal relationship books talk about expectation management. "Under promise and over deliver". The pessimist is never disappointed and often surprised. Live beneath your means. "Let go". Thou shalt not covet.

This is probably more true of things we ask of ourselves: get this, do that, convince them...

What struck me in the phrasing of this old-new aphorism is how clearly it shows that you have the *choice* in setting the expectations. And yet, that goal is the very thing that defines success. And for many goal-oriented folks like us, that success defines our happiness. Ergo, we choose to be happy. Or not.

But it requires self discipline. Not so much in the exercise of effort, but in the restraining of wanting.

You may have heard analyses of the great American marketing machine. It generates unrequited desires, while offering to fulfill them (for a price). The promise of happiness. Oddly, it is more commonly the restraint of those desires that leads to happiness, not the fulfillment.

I think you are going to see many more instances of this wisdom over then next while. See if you can't identify the aspect that you control.

Wednesday, November 10, 2010

Big load testing

I am a real fan of using the real client to do load testing. Your QA engineers will spend a lot of time building regression tests that verify the behavior of the game is still the same and no new bugs have been introduced. That entails adding scripting or player behavior "simulation" to the game code, but also includes creating the scripts that test the functionality of the game. Those test cases are really important and ideally cover almost all of the game functionality. And they have to be kept up to date as the code in the game changes.

Why not reuse all that work to help load test the server? The scaffolding, client hooks, and test cases?

One of my favorite ways of doing this is to have a test driver that picks random test cases and throws them at the server as fast as possible. Even if the test case involves sleeping or waiting for something like the character walking across some area of the game, if you run enough of them at the same time, you can generate significant load. And it is going to be more realistic than any other kind of test prior to having zillions of real players. It also saves you from having to reproduce the protocol and behavior of the real client and maintain it as the game team evolves everything.

Why not? Even if you take the time to make a headless version of the client, it is probably going to be so resource heavy that you will have trouble finding enough machinery to really ramp up. Most games are designed to tick as fast as possible to give the best framerate, but a headless client doesn't draw, so that is a waste of CPU. Some games rely on the timing intrinsic in animations to control walk speed or action/reaction times for interactions. But you want to strip out as much content as possible to save memory. Clearly there is a bunch of work needed to reduce the footprint of even a headless client. But they really are useful.

One thing you can do to make them more useful is construct a mini server cluster and see how it stands up to as many clients as you can scavenge.

You can get hold of more hardware than you might think by "borrowing" it at night from the corporate pool of workstations. You will need permission, and you will want a fool proof packaging so your clients can be installed (and auto-updated) without manual intervention or a sophisticated user. There is nothing like a robot army to bring your server to its knees. IT doesn't like this idea very much because they like to use night time network bandwidth for doing backup and stuff.

Another important trick is to observe the *slope* of performance changes relative to the change in load you throw at the server. If the marginal effect (incremental server load divided by incremental client load) is > 1 you have a problem. Some people call this non-linear or non-scalable performance. Although, to be technical, it is non-unitary. Non-linear means it is even worse that y = a*x + b. E.g. polynomial (x^2), or exponential (y = a^x). Generally you can find the low hanging fruit pretty easily. If the first 500 connected clients caused a memory increase of 100 MB, but the second 500 caused consumed 200 MB you have a problem. Obviously this applies to CPU, bandwidth and latency. And don't forget to observe DB latency as you crank up both the number of clients and the amount of data already in the DB. You may have forgotten an index.

But you may still not have enough insight, even given all this. The next step could be what I call a light-weight replay client or a wedge-client. The idea is to instrument the headless client, or graphical client and record the parameters being passed into key functions like message send, or web service calls. You are inserting a wedge between the game code and the message passing code. The real client can then be used to create a log of all the interesting data that is needed to stress the server. You would then create a replay client that uses only the lower level libraries. It would read the logs, passing the recorded parameters into a generic function that reproduces the message traffic and web requests. It doesn't have to understand what it is doing. The next step is to replace the values of key parameters to simulate a variety of players. You could use random player ids, or spend some more time having the replay client understand the sequences of logs and server responses. E.g. it could copy a session ID from a server response into all further requests.

Since you are wedging into existing source code, this approach is way easier than doing a network level recording and playback. That would require writing packet parsing code, and creating a state machine to try to simulate what the real client was doing. Very messy.

You might still not be able to replay enough load. Perhaps you don't have enough front end boxes purchased yet, but you want to stress your core server. The DB or event processing system. We use a JMS bus (it is great for publish/subscribe semantics that allows for loose coupling between components) to tie most things together on the back end. We built a record/replay system that pulls apart the JMS messages and does parameters replacement much like the wedge client described above. It is pretty simple to simulate thousands of players banging away. Not every client event results in a back end event that affects the DB.

So what we are planning on doing is:
a) build a mini-cluster with just a few front end boxes
b) use QA's regression test cases to drive them to their knees looking for bad marginal resource usage
c) use wedge recordings and replay if needed for even more load on the front end boxes
d) use the JMS message replay system to drive the event system and DB to its knees, also looking for bad marginal usage.
e) do some shady arithmetic to convince ourselves that the simulated client count that resulted in X% utilization of our test cluster will allow us to get to our target client count in the remaining 100-X% utilization available and the new hardware we plan to have in production.

Thursday, October 28, 2010

Use of public/private keys to avoid a race condition

I inherited an interesting technique when I took over the server of SHSO. The use of signatures to avoid a race condition that can occur when a player is handed off between hosts. The use case is: the client requests that the Matchmaker connect them to a game "room" server that has a new game session starting. That room needs correct information about the player, what level they are, what they own, etc. How do you get it to the room before the client connects to that room and the data is needed? A couple of approaches:

You don't. You just leave the client connected but in limbo until their data arrives through the back end asynchronously. This is a pretty normal approach. Sometimes the data arrives before the client makes the connection. So you have to cover both of those cases.
You don't. You leave the client in limbo while the room fetches the data through the back end synchronously. This is also pretty normal, but blocks the thread the room is running on, which can suck, especially if other rooms on the same host and process also get blocked. Yes, you could multithread everything, but that is not easy (see my manifesto of multithreading!). Or you could create a little state machine that tries to remember what kind of limbo the client is in: just-connected, connected-with-data-fetch-in-progress, etc. Personally, I don't allow the processes that run the game logic to open a connection directly to the DB and do blocking queries. DB performance is pretty erratic in practice, and that makes for uneven user experience.
Or have the data arrive *with* the connection. From the client. Interesting idea. But can you trust the client? That is where signed data comes in.

A quick review of cryptography. Yes, I am totally oversimplifying it, and skipping a lot of the interesting bits, and optimization and stuff. However, it is fun to talk about...

A public/private key works by taking advantage of mathematics that is easy to compute in one direction, but really hard to compute in the other direction. The most common is factoring very large integers that are the product of two very large prime numbers. There is only one way to factor the product, but you have to compute and try pretty much every prime number up to the square root of the product, and that can take a looong time.

A public key can be used to encrypt plain text, and the private key is the only thing that can be used to unencrypt it. That means only the owner of the private key can read the plain text (including any else that had access to the public key).

On the other hand, a signature is created by *unencrypting* the plain text using the private key. The public key can then be used to *encrypt* the signature and test if the result equals the plain text again, thereby verifying the signature, and proving that the signature and plain text came from the owner of the private key exactly as they sent it.

Back to the story...the player data is signed using the private key by the Matchmaker and delivered to the client when the Matchmaker directs the player to the correct room. The client cannot tamper with their data without getting caught. The client then sends the data to the game room with the signature when it connects. The room server checks the signature using the public key, and can tell that the data is unmodified and came indirectly from the Matchmaker.

Why not just encrypt the data? The room server could have been set up to be able to unencrypt. Answer: The client wants to read and use its player's data. It wouldn't be able to do that if it were encrypted. And sending it twice (plain, and encrypted) is a waste.

One interesting thing to note is that the client never saw the public, nor the private key.

OK. I know it seems pretty wasteful to send all this data down to the client, just to have it sent back up to a different host in the server. After all, we are talking about bandwidth in and out of the data center, and more importantly, bandwidth in and out of the client's home ISP connection. Only half bad. The client needed the data anyway. It is not the greatest approach, but it is what we have. As Patton is quoted: A good solution applied with vigor now is better than a perfect solution applied ten minutes later.

BTW, another reason this was built this way was that originally the system couldn't predict which room the player would wind up in, so the client needed to bring their data with them.

And that is a segue to the topic of scaling across multiple data centers. It might start to make sense of that extra bandwidth question.

Monday, October 25, 2010

Architect it for Horizontal DB Scalability

The performance of the database in an MMO tends to be the most common limiting factor in determining how many simultaneous players an individual Cluster can support. Beyond that number, the common approach is to start creating exact copies of the Cluster, and call them Shards or Realms. The big complaint about sharding is that two friends may not be able to play together if their stuff happens to be on different shards. Behind the scenes, what is going on is that their stuff is in separate database instances. Each shard only accesses one DB instance because the DB engine can only handle so much load.

There are a couple of approaches that can almost completely get rid of these limits, both of which depend on creating many DB instances, and routing requests to the right instance

The first is an "automated" version of a character transfer. When a player logs in, they are assigned to an arbitrary cluster of machines, and all their stuff is transferred to the DB for that cluster. The player has no idea which cluster they were connected to and they don't have names. There are a couple of problems with this:
1) you can't do this if you want the state of your world around the player to be dynamic and persist; the stuff you throw on the ground; whether that bridge or city has been burned. A player would be confused if each time they logged in, the dynamic state of the world was different. This isn't all that common these days, however. Interesting problem, though.
2) the player might not be able to find their friends. "Hey I'm beside the fountain that looks like a banana-slug. Yeah, me too. Well, I can't see you!"
You might be able to deal with this by automatically reconnecting and transferring the friends to the same cluster and DB, but that gets tricky. In the worst case, you might wind up transferring *everyone* to the same cluster if they are all friendly.

Another approach provides horizontal scalability and is one that doesn't assume anything about how you shard your world, do dungeon instancing, what DB engine you use, or many other complications. That is a nice property, and makes the DB system loosely coupled, and useful across a broad spectrum of large scale persistent online games.

What dynamic data are you persisting anyway? The stuff you want to come back after a power-failure. Most games these days only care about the state of the player, and their stuff. They may have multiple characters, lots of data properties, inventory of items, skills learned, quests in progress, friend relationships, ... If you sort through all your data, you'll find that you have "tuning" or design data that is pretty much static between releases. And you have data that is "owned" by a player.

To a DB programmer, that fact means that the bulk of the data can be indexed by the player_id for at least part of its primary key. So here is the obvious trick:
Put all data that belongs to a given player into a single DB. It doesn't matter which DB or how many there are. You have thousands or millions of players. Spreading them horizontally across a large number of DB instances is trivial. You could use modulus (even player_ids to the left, odd to the right). Better would be to put the first 100,000 players in the first DB, then the next 100,000 into a second. As your game gets more successful, you add new DB instances.

A couple of simple problems to solve:
1) you have to find the right DB. Given that any interaction a player has is with a game server (not the DB directly), your server code can compute which DB to consult based on the player_id it is currently processing. Once it decides, it will use an existing connection to the correct DB. (It is hard to imagine a situation where you would need to maintain connections to 100 DB instances, but that isn't really a very large number of file descriptors in any case.)
2) If players interact and exchange items, or perform some sort of "transaction", you have to co-persist both sides of the transaction, and the two players might be on different DB instances. It is easy to solve the transactional exchange of items using an escrow system. A third party "manager" takes ownership of the articles in the trade from both parties. Only when that step is complete, will the escrow object give the articles back to the other parties. The escrow object is persisted as necessary, and can pick up the transaction after a failure. The performance of this system is not great. But this kind of interaction should be rare. You could do a lot of this sort of trade through an auction house, or in-game email where ownership of an item is removed from a player and their DB and transferred to a whole different system.
3) High-speed exchange of stuff like hit points, or buffs, doesn't seem like the kind of thing players would care about if there was a catastrophic server failure. They care about whether they still have the sword-of-uberness, but not whether they are at full health after a server restart.

Some people might consider functional-decomposition to get better DB performance. E.g. split the DB's by their function: eCommerce, inventory, player state, quest state, ... But that only gets you maybe 10 or 12 instances. And the inventory DB will have half the load, making 90% of the rest of the hardware a waste of money. On the other hand, splitting the DB with data-decomposition (across player_id), you get great parallelism, and scale up the cost of your hardware linearly to the number of players that are playing. And paying.

Another cool thing about this approach is that you don't have to use expensive DB tech, nor expensive DB hardware. You don't have to do fancy master-master replication. Make use of the application knowledge that player to player interaction is relatively rare, so you don't need transactionality on every request. Avoid that hard problem. It costs money and time to solve.

There is a phrase I heard from a great mentor thirty years ago: "embarrassingly parallel". You have an incredible number of players attached, and most actions that need to be persisted are entirely independent. Embarrassing, isn't it?

Now your only problem is how big to make the rest of the cluster around this monster DB. Where is the next bottleneck? I'll venture to say it is the game design. How many players do you really want all stuffed into one back alley or ale house? And how much content can your team produce? If you admit that you have a fixed volume of content, and a maximum playable density of players, what then?

Thursday, October 21, 2010

A lightweight MMO using web tech

I've been really head's down on Super Hero Squad Online. It has been getting really great reviews, and is very fun to play. Beta is coming up, and we are expecting to go live early next year.

So I thought I'd try to cut some time free to record my thoughts about its architecture. But not today. We are getting ready for an internal release tomorrow.

But here are some topics I'd like to pursue:

scaling across multiple data centers
horizontal DB scalability
an approach that avoids back end race conditions when handing off a client between front end boxes
big load testing

Sunday, February 14, 2010

Not seamless and not fun

A lot of the techniques I've discussed have been intended to address the difficult challenges in creating seamless worlds: interest management, load balancing, coordinate systems, consistency, authoritative servers,... But I want to take a moment to say: just because you can, and perhaps have solved these hard technical problems doesn't make you successful. You can avoid the problems by making other game design choices (compromises?) like instancing.

In fact, something like instancing may be inevitable in your game, if you want it to be fun. For example, a fun quest might require no one else be able to spoil things by stealing kills or your hard won loot.

And even if you have a nice seamless world, and you've squeezed out difficult latency issues in the seamless areas, you can still screw up the game so badly people drop out. I played DDO the other day. Turbine is famous for having a pretty good architecture. But boy, is it a pain to play that game. Line up on the crate; hit the crate; walk forward; click on the coins; repeat; repeat; repeat. Line up on a monster, press the mouse and hold it, and hold it, and hold it. Wander around in circles in a dungeon and get lost. Run from one corner of the dungeon to the other pulling levers. Do a quest, report back to the same dude. Gah! Didn't they ask a newb if all that was fun? No wonder people are micro-trans-ing their way up levels.

The newb experience requires repeated entering of instances. It happens two or three times in the first 15 minutes. And it takes 20 seconds or so to load (I'm trying not to exaggerate). But 20 seconds in an "action" game is forever. Ok. Your server architecture and your game design require instances. Fine. But what did the newbs say about how fun the experience was. I, for one didn't like it. Did the developers sit around and say: well, tough, that's the way it has to work. Maybe. But you *can* fix the experience.

Have auto-loot turned on by default. Don't hide the loot in 50 crates; give out more on a kill or a quest completion. Pre-load the instances in the background whenever the player gets near the entrance. It doesn't have to be seamless, but at least *try* to make it fun. Or less teeth grating-ly aggravating. Sorry, I'm not patient enough for it to "start getting fun after level 20". Even if it is free.

--end first rant--

I heard someone the other day say "hey we can't give away the good stuff, the players should have to earn it (or pay for it)." Why? Give the newbs the good stuff right away. Let them get hooked. I think it was Warhammer 40k where the first mission let you play some of the best vehicles and blow a lot of stuff to scrap. Pretty fun. Then the story line took it away. But at least you knew it was going to be a fun game. After only 5 minutes.

Let's get creative. Just because there appear to be insurmountable limitations (instancing, giving away the good content...) doesn't mean they really *are* insurmountable.

Good luck out there.

Monday, January 18, 2010

Dynamic Load Balancing a Large Scale Online Game

Why bother designing your server to support dynamic load balancing? You can load test, measure and come up with a static load balance at some point before going live, or periodically when live. But...

Your measurements and estimates will be wrong. Be honest, the load you simulated was at best an educated guess. Even a live Beta is not going to fully represent a real live situation.
Hardware specs change. It takes time to finish development, and who knows what hardware is going to be the most cost effective by the time you are done. You definitely don't want to have to change code to decompose your system a different way just because of that.
Your operations or data center may impose something unexpected, or may not have everything available that you asked for. You might think "throw more hardware at the problem". But if they are doing their jobs, they won't let you. And if you are being honest with yourself, you know that probably wouldn't have worked anyway.
Hardware fails. You may lose a couple of machines and not be able to replace them immediately. Even if you shut down, reconfigure, and restart a shard, the change to the load balance must be trivial and quick. The easiest way is to have the system itself adjust.
Your players are going to do many unexpected things. Like all rush toward one interesting location in the game. Maybe the designers choose to do this on purpose using a holiday event. Maybe they would really appreciate if your system could stand up to such a thing so they *could* please the players with such a thing.
The load changes in interesting sine waves. Late at night and during weekdays, the load will be substantially less than at peak times. That is a lot of hardware just idling. If your system can automatically migrate load to fewer machines, and give back leased machines (e.g. emergency overload hardware) you might be able to cut a deal with your hosting service to save some money. Anybody know whether the "cloud" services support this? What if you are supporting multiple titles whose load profiles are offset. You could reallocate machines from one to another dynamically.
Early shards tend to have quite a lot higher populations than newly opened ones. As incentives to transfer to new shards start having effect, hardware could be transferred so that responsiveness can remain constant while population and density changes.
The design is going to change. Both during development and as a result of tuning, patches and expansions. If you want to let the designers make the game as fun (and successful) as possible, you don't want to give them too many restrictions.
It may be painful to think about but your game is going to eventually wind down. There is a long tail of committed players that will stay, but the population will drop. If you can jettison unneeded hardware, you can save money and make that time more profitable. (And you should encourage your designers to support merging of shards.) I am convinced that there is a lot of money left on the table by games that prematurely close their doors.

So what can you you actually "balance"? You can't decompose a process, reallocate objects, computation and data structures between processes. Not without reprogramming. Or programming it in from the beginning. So load balancing entire processes is not likely to cut it.

The best way to do that is design for parallelism. Functional parallelism only goes so far. E.g. if you have only one process that deals with all banking, you can't split it when there is a run on the bank.

So what kinds of things are heavy users of resources? What things are relatively easy to decompose into lots of small bits (then recombine into sensibly sized chunks using load balancing)?

Here are some ideas:

Entities. Using interest management (discussed in other blog entries), an Entity can be located in any simulator process on any host. There are communication overheads to consider, but those are within the data center. If you are creative, many of the features that compose a server can be represented as an Entity, even though we often limit our thinking to them as game Entities. E.g. a quest, a quest party/group, a guild, an email, a zone, the weather, game and system metrics, ... And of course, characters, monsters, and loot. The benefit of making more things an Entity is that you can use the same development tools, DB representation, execution environment, scripting/behavior system and SDK, ... And of course, load balancing. There are usually a very large number of Entities in a game, making it pretty easy to find a good balance (e.g. bin-packing). Picking up an Entity is often a simple matter of grabbing a copy of its Properties (assuming you've designed your Entities like this to begin with; with load balancing in mind). This can be fast because Entities tend to be small. Another thing I like about Entity migration is that there are lots of times when an Entity goes idle, making it easy to migrate without a player being affected at all. Larger "units" of decomposition are likely to never be dormant, so when a migration occurs, players feel a lag.
Zones. This is a pretty common approach, often with a number of zones allocated to a single process on a host. As load passes a threshold, the zone is restarted on another simulator on another machine. This is a bigger chunk of migration than an Entity, and doesn't allow for an overload within one zone. The designers have to add game play mechanisms to discourage too much crowding together. The zone size has to be chosen appropriately ahead of time. Hopefully load-balancing-zone is not the same as game-play-zone, or the content team will really hate you. Can you imagine asking them to redesign and lay out a zone because there was a server overload?
Modules. You will decompose your system design into modules, systems, or functional units. Making the computation of each of these be able to be mapped requires little extra work. Although there are usually a limited number of systems (functional parallelism), and there is almost always a "hog" (See Amdahl's law). Extracting a Module and moving it requires quite a bit more unwiring than an Entity. Not my first choice. But you might rely on your fault tolerance system and just shut something down in one place, and have it restart elsewhere.
Processes. You may be in a position where your system cannot easily have chunks broken off and compiled into another process. In this case, only whole processes can be migrated (assuming they do not share memory or files). Process migration is pretty complicated and slow, given how much memory is involved. Again, your fault tolerance mechanism might help you. If you have enough processes that you can load balance by moving them around, you may also have a lot of overhead from things like messages crossing process boundaries (usually via system calls).
Virtual Machines. Modern data centers provide (for a price) the ability to re-host a virtual machine, even on the fly. Has anyone tested what the latency of this is? Seems like a lot of data to transmit. The benefit of this kind of thinking is that you can configure your shard in the lab without knowing how many machines you are going to have, and run multiple VM on a single box. But you can't run a single VM on multiple boxes. So you have that tradeoff of too many giving high overhead, and too few giving poor balancing options.

Remember, these things are different from one another:

Decomposition for good software engineering.
Decomposition for good parallel performance.
Initial static load balancing.
Creating new work on the currently least loaded machine.
Dynamically migrating work.
Dynamically migrating work from busy to less loaded machines.
Doing it efficiently, and quickly (without lags)
And having it be effective.

I think balancing load is a hard enough problem that it can't really be predicted and "solved" ahead of time. So I like to give myself as much flexibility ahead of time, and good tools. Even if you don't realize full dynamic migration at first, at least don't box yourself into a corner that requires rearchitecting.

Pages