Thursday, April 30, 2009

Why would you want gateway boxes?

An important aspect of an MMO shard is where and how clients connect. My preference is to have a set of Gateway boxes that responsible for managing those connections. Their responsibilities include:
  • Authentication handshakes with the client and account system
  • Rapid filtering of malicious traffic (IP filtering or more), denial of service attacks, smurfing (billions of identical, apparently innocuous/legal requests), and so on
  • Separation of responsibility. This is valuable in a memory and cache efficiency sense, in that a single machine is focusing on just one thing. Simulators don't have to keep track of clients. They only connect to GW boxes.
  • Instead of N clients each connecting to up to K simulators (N*K connections), you have only N client connections, each to only one of J gateways and the K simulators only connect to the J gateways (N+J*K). Since N is much bigger than J, this is a huge advantage both in connection count, connection management processing overhead, and memory on both client and server hosts. Normally N is quite a bit bigger than J (say 4 times).
  • Message exploding and the majority of connections are happening in the data center over high speed switches, and in a secure environment on the backend switch.
  • By focusing only on GW functions, and not running game logic, they can be made much more reliable than simulator boxes/processes. This can help a lot with player experience during fault recovery
  • The Gateway boxes are the only ones with public IP addresses, so it allows a large fraction of your shard to be secure by having no direct route from the WAN. The idea here is GW boxes have (at least) two NICs, one on the switch with the main firewall, the other on the backend network.
  • This also has physical network topology benefits, since the back end hosts can be on their own switch.
  • Message header overhead is reduced when sending to a client, since all data to one client is from a single shard host, and it can do bundling for over the WAN messages (most important).
  • Gateways can also "be" the lobby or character selector prior to entering the game world.
  • Non simulation messages like chat or game service stuff (auctions, guild management, email, patching/streaming of content, ...) don't bother the simulators.
  • The sizing and configuration to optimize for the size of your peak player connections are now independent of that for the simulators and load balancing.
I also subscribe to the philosophy of persistent client connections (or with connectionless protocols, staying with one assigned gateway). The major benefit of this is that a client does not have to reauthenticate and renegotiate their connection with another host in the shard when their character moves around in the world, or some other load balancing activity changes the simulators they need to interact with.

To do this, the GW is also responsible for routing messages between the client and the simulators that are "of interest". This gets back to category based routing and channel managers discussed earlier. Data from multiple simulators is sent to the GW box and forwarded to each interested client.

Data from the client tends to be routed to the one Simulator that client is currently using. I.e. the one that owns its "controller" entity where client requests are validated, and (normally) their player character is owned/simulated.

You want multiple gateway processes (not multi threaded) per gateway box to avoid losing as many player connections if something should crash (and then reauthenticating, etc). This also helps deal with file descriptor limitations per process for the connections if your OS configuration limits you.

There are downsides, but not overwhelming:
  • An extra hop for most messages. This hop is on a datacenter switch, and will be very fast.
  • There are extra machines to buy. Well, not really, the same message handling work is being done but not directly by the simulators, so they can get more done each (and that has other more subtle communcation benefits). We just partition it, and use the same number of machines.
  • An extra switch and extra NICs. You can use two VLANs on any decent switch, if you have to.
In summary, you are just moving some work from one place to another in the same sized shard, but getting a lot of system simplicity, security, and communication benefits.

Friday, April 24, 2009

Peer to What?

Ever notice how people use exactly the same word to make an important differentiation? Shorthand, laziness, different backgrounds, or maliciousness?

So in our industry what do people really mean, or think they mean when they ask "Hey. Does that support peer to peer?" Here are a couple of definitions and more clear terminology that I prefer. Maybe I can find some sources to back me up. BTW, I'm thinking of small-scale online today. Although many/most of my architectural biases apply here as well. (Other than whether the server is hosted or not).

Academically speaking, peer-to-peer is a communication topology that, in general, indicates that clients or players communicate directly with each other. Like peer-to-peer file sharing. It is used in contrast with the client-server topology where each client has a single connection to the server, and any interaction between clients is performed by means of the server. One benefit of the peer to peer topology is that data can move between clients with one hop of latency instead of two. One benefit of client-server is that each client can be presented the same data in the same order (deterministically).

However, many people are less concerned with whether two clients communicate directly, but instead use the term peer-to-peer to indicate they desire other features:
  1. There is no part of the game hosted in a central location like a data center. Note that central-server or authoritative-server is not the same thing as hosted-server.
  2. There is no stand-alone server that needs to be stood up ahead of time, possibly occupying another piece of hardware. When the player starts their client, the multiplayer game is automatically ready to go. There is no separate server process consuming extra resources on one of the user's machine.
  3. There is no single point of failure in the distributed system. If the master (usually the first player to start) were to drop out of the game, there is no reason for the remaining players to go through matchmaking again.
It turns out that a client-server topology can be used and still achieve all three of the above features. Obviously, "hosted" or not in #1 is easy. Just run the server on someone's machine. The long-running server in #2 is easy. Either start one up and shut it down when the client is started or stopped, or you join someone else's game session. Or if you are resource constrained (like on a console), and can't abide the second process, make a dual-purpose client. The first player in would have their client become the master/authoritative server. #3 requires a little bit of technology. When the master drops out of the game, another of the clients must become the master, and the other clients connect directly to them. This gets a little tricky in a non-uniform network, but it is a problem that is solvable with automatable algorithms, so the players don't know that it is happening.

So even if you have some of the three problems above, you don't have to jump to the conclusion that client-server is therefore not approprite. And given how much easier it is to get a system running that assumes there is a single authoritative server, I'd recommend you look into it, and prove (ie. measure) whether you have an intractible problem with client server.

In the mean time, I'm going to continue to be skeptical when folks say "I need peer-to-peer". My question will be "what feature of peer-to-peer do you think you need?". My "favorite" topology for small scale non-hosted multiplayer is what I call "peered-server" to contrast it to hosted-server. Client server communication topology, single authoritative simulator, no central persistent hosted server process (other than a match maker, which is not the same thing as a simulator).

Really, the thing to keep in mind is that the process communication topology does not have to equate to the network communication topology. It could be a little more efficient if they match, but could be a lot harder to get your project finished.

Monday, April 20, 2009

How to ignore most emails (safely)

At work, I use Outlook. It has "organization" tools. One of the best is found here: View/ArrangeBy/Custom/AutomaticFormatting. It gives you a set of conditions and formatting for each line in the list view. I use the default of unread mail formatting with a bold font.

1) I use my favorite font color for emails that are sent only to me: select "where I am:" "the only person on the To line". Even if others are CC'd on this email, it was sent directly to you. You'd better read it. No one else will.

2) My second favorite font color for emails sent specifically to me (and others): select "where I am:" "on the To line with other people". Someone took the time to direct this email directly to you and some of your peers. You should probably read it. Maybe one of your peers will respond, but you never know.

3) My third favorite font color for emails that I was specifically CC'd on: select "where I am:" "on the CC line with other people." (also covers where you were the only person on the cc line). Someone took the time to include you on the thread, but you know it is FYI (for your information) only, so you can almost certainly ignore it without getting in trouble, or until you have some time. This allows you to decide based on the subject line, for example. Many #3's turn into someone else's thread, and I can get by with skimming only a couple of those.

4) My least favorite color is "all the rest", and will be the default formating. These will almost always be delivered to you because you are on an email group. These can almost always be ignored for several days. If you happen to be on one or two where that is not the case, make a custom rule to highlight these. If you are on a ton of groups that each need immediate attention, you are probably doing something else wrong that I can't give you emperical advice about. I know I wouldn't survive long under that kind of stress. Don't borrow that kind of trouble. Seriously consider whether your response in those forums is honestly needed so urgently. I'l bet if you ignore it and you are the only one that can respond, you are eventually going to get a #1 or #2 email on it.

I also use a custom font color for "followup" emails that I have specifically marked as needing my attention later. Not only do they show up as red, but they sort to the "bottom" of my list view (I read my list like a console window scrolls.

I don't empty my inbox. I think that is "busy work", and just puts important items out of view in other mail boxes where I wind up forgetting about them. I only open half or 2/3rds of my emails, and only when I decide I have time. I'll flip the window open (left on list view) and see if there are any #1's and then just close it. That means almost half my emails are left in the "bold" state, and I don't feel guilty about that. Those are emails that were broadcast, not sent "to" me expecting me to solve a problem. They were FYI to me. If it was an important broadcast, it probably had the "high priority" flag set.

I also use the preview window and set auto-read on (marks the item as read if you "preview" it for more than a few seconds). If I get the sense I need to read this more carefully or later, I'll manually flip it to unread so it stands out and I'll reevaluate it later.

These tricks work well in meetings when you think you need to glance at your queued emails, or any other time when you are very pressed for time. A simple glance at the font colors gets you focused on the one or two emails you should look inside. Often there is little or nothing to do with even those, so mark them unread for when you have a block of time to "clear" the backlog.

Thursday, April 16, 2009

What does a "Flexible" architecture look like?

We've all heard the maxim that the only constant is change. This is true on a lot of levels. During the development of a title, the designer is going to come up with some doozies..."Hey, what if the player controls the character *and* his minions?" Maybe it is an important change, but if you have a big complicated (and maybe purpose-built) MMO system, things like that can give you nightmares.

So even if you know (or think that you know) what the game is going to look like, 3-5 years down the road, it won't, and you'll have a lot of difficult refactoring if you don't plan for change. Put in insulation layers so that big changes don't propagate very far. For example, don't assume you are going to be using SQL, and start coding Entity behaviors with embedded queries. Instead, define a persistence interface that could be implemented by any number of technologies. Maybe it starts as a flat file, or an XML DB. But by the time you go live, Microsoft bought your company and they want the DB on SQL Server. And while we are talking about the DB, don't *ever* talk to the DB with a synchronous query that blocks gameplay. You might be surprised by the variance in response times even a good Oracle DB will give. My horror story: chances were 50:50 that when we deployed a new version on TSO that had a schema change, that the DBA's would migrate the data in some hand-cobbled way. And (get this) an index file would disappear. Things looked ok until you got a few hundred live customers connected.

Where was I? Insulation. An architecture that can withstand or localize pretty radical changes is simultaneously flexible. You can use the parts in different ways, or replace things you don't like. And when you replace them the change doesn't ripple very far. I'm arguing you need to do this even for a purpose-build MMO-engine, so doing the same thing for a middleware MMO engine results in no net impact on performance or usability. And since a middleware developer wants to sell their engine to studios doing a variety of titles, that flexibility is required just to make the sale.

I like to think about sitting across from a hard-nosed tech-lead in a sales meeting who absolutely *NEEDS* a certain feature that we don't have. I can say: no, we didn't do that, but its flexible, and you can swap that peice out. They can start with what we have, and do the swap out if they have time; which I cynically think doesn't happen very often. When have you had "extra" time during development? Well, at least we made the sale. Its a lot better than saying no we don't do that, our way is more efficient, and it will be wicked-hard to change. OK, I have to share this too: GDC '07 or '06, I heard Tim Sweeny say: "Modularity is overrated". Makes you wonder why folks love to hate developing with Unreal.

How do you create insulation? Look into PIMPL (pointer to implementation) or Interfaces (pure virtual base classes) to hide all implementation detail from users of a module. Make the modules loosely coupled. Don't assume too much about ordering or synchronous interaction between modules. One of my favorite patterns is the publish/subscribe or producer/consumer pattern. One module sends a message that is categorized, and has no idea who might consume it. Modules that want that data or notification subscribe ahead of time, and run a handler when that message is sent. Now you can add new modules without even recompiling the sender code, much less adding a compile-time dependency. Turns out this approach is good for improved compile times too.

Loose coupling between software modules is really great. Take it to the next step, and avoid (run screaming from?) synchronous interaction between processes. E.g. I wouldn't use CORBA. Too easy to create deadlocks. Instead, prefer sending an asynchronous message. Don't use critical sections in blocks of code, or locks on all kinds of data structures. Too easy to create deadlocks. Instead prefer sending an asynchronous message.

I see a pattern developing here. It leads to the ability to map logical processes to other threads or other remote processes with very little change to your code. The event-oriented, message-based approach to distributed systems is very successful. I guess I've already talked about CSP (Communicating Sequention Processes) in an early post, so I'll just drift off.

Tuesday, April 7, 2009

Never Too Many Asserts

My friend Keisuke says an assert is like a circuit breaker. It does no harm to be in the circuit when things are operating normally. But when the voltage goes out of bounds, your program stops.

We all accept that a defect costs more to fix the later it is discovered. If you have already checked in, it affects your coworkers. If you have already shipped you will have to go through a new release or public patch. The tenets of extreme programming are that if something is good, taking it to an extreme is probably better. So if we catch a defect in an assert while we are actually developing nothing could be better. If our coworkers "protect" their modules with asserts, we can be more confident that our use of their stuff is legitimate. And we can go faster, and have a more reliable system.

I like to use a number of different kinds of asserts:
  • External Interface (ASSERT_EI): validates that the parameters to a method that is expected to be used by other major modules or by the customer are legal and in range. But they should not be used to validate user input. This is a super-critical assert. The more of these the better.
  • Internal Interface (ASSERT_II): validates the parameters to methods that exist for good code structure and are intended for use only within the current module. Private methods would use these. They are for defensive programming, and reminding yourself what your assumptions were when you designed this method. They can provide a kind of documentation. These are less important than ASSERT_EI, but are great for debugging old code that you don't remember very well or that you didn't write. Again, the hope is that you can go faster. What if you took a short cut, and haven't finished a method for one use case. You can leave an assert such that if you accidentally use the method that way, you will obviously be reminded of the missing work, as opposed to having the method silently do nothing, or fail with some bizzare side effect.
  • Internal Consistency (ASSERT_IC): validates that the state of a class remains consistent as it is manipulated. It is an invariant check. The design of your module makes assumptions about its data structures (e.g. an data item is in a list only once). Assert this periodically. I like to add a SanityCheck method to almost every class I build. It executes all the invariant checks I can think of (and yes, it can be pretty slow). It is especially useful to sprinkle around if you are currently tracking down a bug. It makes sense to verify your invariants going into and out of a method. Centralizing those invariants in a SanityCheck function can be pretty useful.
  • External Consistency (ASSERT_EC): I don't use this very much, since nicely modular systems should not have tight interdependencies. For example, your module may be dependent on the configuration file being parsed before it is initialized. An ASSERT_EC can check (and document) that assumption.
I've seen game programmers take a couple of positions against asserts:
  • They slow down the execution of debug builds so much that the app becomes unusable. OK. It really is too slow. There should be easy ways to disable the asserts especially in heavily executed bits of code. In fact, one should think twice about putting asserts inside tight loops. Also, your assert implementation should be able to runtime skip the predicate based on runtime configuration files, and do it fairly efficiently (e.g. check a global boolean). Disabling per module would be a good start. Providing a level-of-detail argument to the asserts should be possible.
  • I keep having to skip some assert in someone else's code that I don't understand, and that interferes with my workflow. Everyone should be careful not to use asserts to remind us of work that needs doing. Some other mechanism should be used. Or wrap those checks in a per-developer ifdef. On the other hand, if a developer is hitting asserts they don't understand, maybe they are breaking code they don't understand. Using either comments or change-control software, a developer should find the auther of the assert, and learn about that bit of the system, or get them to change the assert if it is now obviated.
  • I changed one little thing and all these asserts starting going off. This one is pretty funny. Given that the asserts are there for a reason, there is a pretty good chance that the one little change had much further reaching implications than expected. For example: someone now wants to handle a member being NULL. But ASSERT_IC's start going off that the member shouldn't be NULL. The thing is, if the rest of the class was built assuming that member can never be NULL, it could easily derefernce the pointer without checking. An argument that "it should have checked for NULL" doesn't fly. The assumption was built in.
Adopting these "extreme" uses of asserts might even take the place of unit tests for that class.

Wednesday, April 1, 2009

In Game "Transactions"

There are times when two players (or even other kinds of Entities) need to trade an object or perform some other action that is critical to occur transactionally. Generally, this is meant to mean that no matter what kind of error occurs, the transaction occurred or did not occur. E.g. I paid you 10K, and got a gold bar.

Support for transactions needs to work across server hosts. It is a very disruptive experience to be able to interact in one spot, but not 2 feet to the left. You have to assume that the players can purposely crash one or the other host at the worst possible point in the interaction (like Byzantine Generals). Assuming you can trust you database, the idea is simply to get both sides of the interaction saved to the DB simultaneously (i.e. in one DB transaction).

What if the two Entities are on different hosts? You have to simultaneously change remote variables and get the persistence request from two places to the DB, blah, blah, distributed handshake, Byzantine...brain fry. They do this in Cobal for banking systems. How hard can it be? Well it is too hard to be worth it (if you include failure/backout/retry, etc), even if you think it would be a fun challenge.

So here is a fantastic consequence of the non-geometric load balancing mechanism we've been talking about. Each simulator is single threaded (if you use multiple threads, you have multiple simulators). An Entity Behavior will execute without preemption. So straight-line code will run to completion, a DB save request can occur, and all is good. All you need to do is get both Entities onto the same simulator, and run that straightline code and co-persist the two local Entities (i.e. send the db save request with both entity's states). Boom. Done. Either the transaction makes it to disk or it doesn't. Not just half. This scenario is the origin of a lot of dupe bugs you've heard about.

The fact that migration policy and mechanism are separate means that adding a new policy is easy. E.g. migrate the guy I'm about to interact with over here (or me over there). Once it finishes, the transactional behavior can be scheduled. If things are a little crazy, it may take a while, but it won't fail and start giving away money. And it won't tell the user the transaction succeeded when it didn't.

Clearly, this is not something you want to do all the time. It would be too slow. Just for the stuff that *really* matters to the users.

If you don't like the thought of your Entities migrating around, build an Escrow Entity. Place one side of the transaction into it and copersist. Migrate to the other side, place the other half in and copersist. Move out the half, persist, migrate, move out the other half, persist, done. If there is a failure at any point, the Escrow object is known about by the DB and it can continue, or return the goods. Just like a Lawyer, but not as expensive.

Either of these approaches can support multi-Entity Transactions.