Friday, August 21, 2009

What is software architecture (in 15 pages or less)?

Kruchten's "4+1 views of software archicture" is one of my favorite papers of all time. It shows four views of software architecture (logical/functional, process, implementation/development, and physical). The plus one is use-cases/scenarios.

Being 15 pages long, it is an incredibly efficient use of your time if you want to disentangle the different aspects of designing complex systems. And it gives you terminology for explaining what view of the system you mean, and which you have temporarily abstracted away.

Don't get hung up on the UML terminology:
  • Logical is what is commonly refered to as a bubble diagram. What components are in your system and what are they responsible for?
  • Process is which OS processes/threads will be running and how they communicate.
  • Implementation is the files and libraries you used to build the system.
  • Physical is your hardware and where you decided to map your processes.
I prefer the original paper:

Monday, August 10, 2009

Fail over is actually kind of hard (and expensive)

You are building a peer to peer or peered server small scale online game. There are players that purposely disconnect as soon as someone start beating them. They think their meta score will be better, or whatever. Or maybe they have a crappy internet connection. In any case, the master simulator/session host goes down abruptly; now what?

At least you would want to keep playing with the guys you've matched with; the rest of your clan or whatever. At best, you'd like to keep playing from exactly the point of the "failure" without noticing any hiccup (good luck). The steps to get session host migration/fail over to a new simulator:
  • Restablish network interconnection
  • Pick a new master simulator
  • Coordinate with the Live Service about who is the master (or do this first)
  • Have the entity data already there, or somehow gather it
  • Convert replicated entities in to real owned/master entities
  • Reestablish interest subscriptions, or whatever distributed configuration settings are needed
  • "Unpause"
Some challenges:
  • To reconnect, someone needs the complete list of IP/ports used by the players. But that is consider a security issue. E.g. someone could use that info to DOS attack an opponent. Let's assume the Live Service renegotiates and handshakes your way back into business.
  • If you aren't connected, how do you elect a new master? If you don't have a master yet, how would all the clients (in a strict client/server network topology) know who to connect to? So the answer has to be precomputed. E.g. designate a backup simulator before there is a fault (maybe the earliest joiner, lowest ip address...)
  • If your game session service supports this, it can solve both of the previous issues by exposing IP addresses only to the master simulator, and since it has a fixed network address, each client can always make a connection to it and be told who is the new master.
  • If the authoritative data is lost on the fault, you may as well restart the level, go back to the lobby or whatever. So instead, you have to send the entity state data to the backup simulator(s) as you go. This is actually more data than is necessary to exchange for an online game that is not fault tolerant, since you'd have to send hidden and possibly temporary data for master Entities. Otherwise you couldn't completely reconstruct the dead Entities. There may be data that only existed on the master, so gathering it from the remaining clients isn't going to be a great solution. Spreading the responsibility is that much more complicated.
  • So the backup master starts converting replicated Entities into authoritative Entities. Any Entities it didn't know about couldn't get recreated, so the backup master has to have a full set of Entities. Think about the bandwidth of that. You should really want this feature before just building it. Now we hit a hard problem. If the Entities being recreated had in-flight Behaviors (e.g. you were using coroutines to model behavior), they can't be reconstructed. It is prohibitively expensive to continuously replicate the Behavior execution context. So you wind up "resetting" the Entities, and hoping their OnRecreate behavior can get it running again. You may have a self-driven Behavior that reschedules itself periodically. Something has to restart that sequence. Another thing to worry about: did the backup simulator have a truly-consistent image of the entity states, or was anything missing or out of order? At best this is an approximation of the state on the original session host.
  • Unless you are broadcasting all state everywhere, you are going to have to redo interest management subscriptions to realize bandwidth limitation. This is like a whole bunch of late-joining clients coming in. They would get a new copy of each entity state. Big flurry of messages, especially if you do this naively.
  • Now you are ready to go. Notify the players, give them a count-down...FIRE!
What did we forget? What defines "dropped out"; a maximum unresponsiveness time? What if the "dead" simulator comes back right then? What if the network "partitioned"? Would you restart two replacement sessions? Do you deal with simultaneous dropouts (have you ever logged out when the server went down? I have.)?

Note that the problem gets a lot easier if all you support is clean handoff from a running master to the new master. Would that be good enough for your game.

So is it worth the complexity, the continuous extra bandwidth and load on the backup simulator? Just to get an approximate recreation? With enough work, and game design tweaking, you could probably get something acceptable. Maybe give everyone a flash-bang to mask any error.

Or maybe you just reset the level, or go back to the lobby to vote on the next map. And put the bad player on your ban list.

Me? I'd probably instead invest the time of my network and simulator guys in something else, like smoothness, fun gameplay, voice, performance. Or ship earlier.