Friday, August 21, 2009

What is software architecture (in 15 pages or less)?

Kruchten's "4+1 views of software archicture" is one of my favorite papers of all time. It shows four views of software architecture (logical/functional, process, implementation/development, and physical). The plus one is use-cases/scenarios.

Being 15 pages long, it is an incredibly efficient use of your time if you want to disentangle the different aspects of designing complex systems. And it gives you terminology for explaining what view of the system you mean, and which you have temporarily abstracted away.

Don't get hung up on the UML terminology:
  • Logical is what is commonly refered to as a bubble diagram. What components are in your system and what are they responsible for?
  • Process is which OS processes/threads will be running and how they communicate.
  • Implementation is the files and libraries you used to build the system.
  • Physical is your hardware and where you decided to map your processes.
http://en.wikipedia.org/wiki/4%2B1_Architectural_View_Model
I prefer the original paper: http://www.cs.ubc.ca/~gregor/teaching/papers/4+1view-architecture.pdf

Monday, August 10, 2009

Fail over is actually kind of hard (and expensive)

You are building a peer to peer or peered server small scale online game. There are players that purposely disconnect as soon as someone start beating them. They think their meta score will be better, or whatever. Or maybe they have a crappy internet connection. In any case, the master simulator/session host goes down abruptly; now what?

At least you would want to keep playing with the guys you've matched with; the rest of your clan or whatever. At best, you'd like to keep playing from exactly the point of the "failure" without noticing any hiccup (good luck). The steps to get session host migration/fail over to a new simulator:
  • Restablish network interconnection
  • Pick a new master simulator
  • Coordinate with the Live Service about who is the master (or do this first)
  • Have the entity data already there, or somehow gather it
  • Convert replicated entities in to real owned/master entities
  • Reestablish interest subscriptions, or whatever distributed configuration settings are needed
  • "Unpause"
Some challenges:
  • To reconnect, someone needs the complete list of IP/ports used by the players. But that is consider a security issue. E.g. someone could use that info to DOS attack an opponent. Let's assume the Live Service renegotiates and handshakes your way back into business.
  • If you aren't connected, how do you elect a new master? If you don't have a master yet, how would all the clients (in a strict client/server network topology) know who to connect to? So the answer has to be precomputed. E.g. designate a backup simulator before there is a fault (maybe the earliest joiner, lowest ip address...)
  • If your game session service supports this, it can solve both of the previous issues by exposing IP addresses only to the master simulator, and since it has a fixed network address, each client can always make a connection to it and be told who is the new master.
  • If the authoritative data is lost on the fault, you may as well restart the level, go back to the lobby or whatever. So instead, you have to send the entity state data to the backup simulator(s) as you go. This is actually more data than is necessary to exchange for an online game that is not fault tolerant, since you'd have to send hidden and possibly temporary data for master Entities. Otherwise you couldn't completely reconstruct the dead Entities. There may be data that only existed on the master, so gathering it from the remaining clients isn't going to be a great solution. Spreading the responsibility is that much more complicated.
  • So the backup master starts converting replicated Entities into authoritative Entities. Any Entities it didn't know about couldn't get recreated, so the backup master has to have a full set of Entities. Think about the bandwidth of that. You should really want this feature before just building it. Now we hit a hard problem. If the Entities being recreated had in-flight Behaviors (e.g. you were using coroutines to model behavior), they can't be reconstructed. It is prohibitively expensive to continuously replicate the Behavior execution context. So you wind up "resetting" the Entities, and hoping their OnRecreate behavior can get it running again. You may have a self-driven Behavior that reschedules itself periodically. Something has to restart that sequence. Another thing to worry about: did the backup simulator have a truly-consistent image of the entity states, or was anything missing or out of order? At best this is an approximation of the state on the original session host.
  • Unless you are broadcasting all state everywhere, you are going to have to redo interest management subscriptions to realize bandwidth limitation. This is like a whole bunch of late-joining clients coming in. They would get a new copy of each entity state. Big flurry of messages, especially if you do this naively.
  • Now you are ready to go. Notify the players, give them a count-down...FIRE!
What did we forget? What defines "dropped out"; a maximum unresponsiveness time? What if the "dead" simulator comes back right then? What if the network "partitioned"? Would you restart two replacement sessions? Do you deal with simultaneous dropouts (have you ever logged out when the server went down? I have.)?

Note that the problem gets a lot easier if all you support is clean handoff from a running master to the new master. Would that be good enough for your game.

So is it worth the complexity, the continuous extra bandwidth and load on the backup simulator? Just to get an approximate recreation? With enough work, and game design tweaking, you could probably get something acceptable. Maybe give everyone a flash-bang to mask any error.

Or maybe you just reset the level, or go back to the lobby to vote on the next map. And put the bad player on your ban list.

Me? I'd probably instead invest the time of my network and simulator guys in something else, like smoothness, fun gameplay, voice, performance. Or ship earlier.

Thursday, July 30, 2009

Incremental release vs. sequel

(Warning: plenty of irony below, as usual)

There was a time when the MMO developer community thought that the ideal was to stand up your world, and then start feeding the dragon. As quickly as possible, get new content into the players hands. The more new content, the more fascinated they would be, the stickier their subscriptions would be and the more money you would make.

So we put a lot of effort into techniques to manage continuous development of content, test it, and roll it out with the minimum possible maintenance window. Some got good enough they could release content or patches every week. Didn't we have automated client patchers? Why not use those to continuously deliver content. Not just streaming content as you move around in virtual space, but as you move forward in real time.

Then someone noticed that Walmart took their game box off the shelf because it had been there for a year, and new titles were showing up. Surely consumers want the new stuff more? Besides, you don't need the latest release, you get all the new stuff when you patch. Then new subscriptions drop because of that lack of visibility at retail. Why would Gamestop sell prepaid cards if it is so easy to pay online?

So the light goes bing, and it is suddenly obvious that sequels would be a much better approach, since you'll get shelf space if it is a new SKU. Clearly this is the best approach, since it works so well for Blizzard. (So clearly, I cannot choose the glass in front of you. Where was I?) All you have to do is patch a few bugs, and set up a parallel team to work on the sequel.

But then everyone piles onto Steam. Definitely the end of brick and mortar. Maybe we go back to the low-latency content pipeline so our game is fresher than the sequel-only guys. But wait, Steam sales and free trials increase traffic at Gamestop.

Clearly, I cannot choose the glass in front of me... Wait til I get started.

Not really. As you can see, the point is that technologists are very unlikely to see the future of sales and distribution mechanisms. And if we did, it would take a year to adapt our development practices and product design to take optimal advantage of it.

The answer? Be flexible. Don't assume you've got the one and only magic bullet. Requirements change. And for an MMO the development timespan is large enough that a lot of things will change before you are finished. Don't implement your company into a corner with a big complex optimal single point solution, and keep your mind open.

Monday, July 13, 2009

Web tech for "game services"

I hold the opinion that every disagreement is a matter of different axioms, values or definitions.

I believe definitions is what is going on with this post by "Kressilac" (Derek Licciardi?):
http://blogs.elysianonline.com/blogs/derek/archive/2009/05/29/6400.aspx I'd guess we do hold the same values.

Derek argues that portions of an MMO server are suited to using and best implemented using web technology. I absolutely agree. I call these parts of the system "Game Services". Most would be accessed directly from the client. Examples:
  • profanity filtering,
  • shard status, open, full, down, locked, capped
  • in game search/player online,
  • clan/guild management,
  • item trade,
  • auction,
  • voting/elections,
  • chat,
  • match making/lobby,
  • leaderboards,
  • persistent messages/email,
  • reputation management/community services,
  • in-game advertising
  • Search,
  • authentication,
  • CSR account locking
  • patching, streaming patching
  • microtransactions
  • petitions
  • custom content
  • character annotation, friend lists,
  • knowledge base
  • voice chat
  • Maybe: inventory, quests, crafting (touches in-game entities)
Anyone got more for this list?

Most of these systems are "decorative" and are for the community aspects of the game.

The complication arises where the data managed by these services is affected by or used by the simulator (I.e. in-game logic). E.g. the number of members of your clan changes Mana recharge rate. I'd suggest that most of those kind of communications are not critical to be transactional or latency critical or can have the game design bent to accommodate that restriction.

There are a couple of those game services (especially those interacting with items) that are entangled. The easiest way to deal with those is to transfer ownership of the Entities in question to one system or the other such that there is no synchronization needed other than at the transfer. I'm betting that is how WoW does auctions and mailing of items.

My "run screaming; it sucks" article is my thinking about the core gameplay/simulator manipulated Entities. What Derek calls Real Time Data. To me that is the "hard problem". All the rest of the stuff can be handled by web-tech, and that is a solved problem (waves hand dismissively), and not so interesting.

Well. There are a few interesting issues, like coordinating authentication. But the coolest payoff (as Derek states) is that these things automatically become available offline via browsers, mobile devices, etc.

BTW, I'm working on another contentious article that more fully details the issues that drive my opinion about DB-centric game state management.

Monday, June 29, 2009

Google's Protocol Buffers (for messages and files?)

This is an interesting package:

http://code.google.com/apis/protocolbuffers/docs/overview.html



It can be used for on-the-wire and for file formats. It is much more efficient than XML, and has multiple language API’s (e.g. easier to send a message between apps of different languages). It deals with versioning, and has automated class and serialization code generation.


I haven't used it yet. Any thoughts? How good is it for archive files/pack files? How good is its historical version handling wrt up-converting or semantic changes (like feet to meters)?

Tuesday, June 23, 2009

Entity Concurrency

Ever since Simula and Demos in the 60's, object-oriented (or process-oriented) simulation has been considered the most natural and intuitive approach to representing a system. Certainly it is the way we think when we write programs using imperative languages (like C++), even when taking advantage of multithreading.

Over simplifying things, process oriented entities have a "main loop" which continues to be in scope, on the stack but possibly suspended even as the entity is idle or blocked waiting for something. An example would be a vehicle that stops at a red light. After the light goes green, the program continues with the next line of code (perhaps navigating to the nearest gas station). Conversly in an event-oriented simulation, the car would be unscheduled and the light would have to signal the car change its state from waiting to driving the next leg of the route. In the one case, the programmer can write all the code from the perspective of the entities (the vehicle, the light...). In the other, they write a soup of events and state changes that is very hard to visualize and see whether the logic is correct.

The easiect way to realize process oriented entities is to take advantage of coroutines built in the scripting language of your choice. Using a thread per object can get horribly expensive, even if they were cooperative threads. Using a coroutine allows a program to choose to block between two statements and go idle, context switching to another entity. When the resource being blocked on is available, the system can switch back to the suspended entity.

So the challenge is coordinating the objects. You can look at a previous article on bin/res for some ideas.

You can even have multiple concurrent activities running on an entity. E.g. monitoring your fuel level, driving a route, and listening for new orders from a taxi-dispatcher. Each would consume another coroutine. A complication here is that when a coroutine blocks and goes idle, the others might run and change the value of states in the entity. Fortunately this does not result in race conditions because coroutines are cooporative and only switch at a point that the programmer chooses. They can then be very aware that other processing might change things before they wake up again.

Coroutines can also be used to spread computation across multiple ticks without having to refactor the algorithm you are using. If you have a long AI or path planning algorithm, you could suspend at any point and resume during the next tick. The exact context is restored by the scripting language coroutine system. It might also make it easier to cancel such a computation at those suspend points.

When thinking about online games, load balancing is critical. We do it by migrating entities. But if an entity has a suspended context in a coroutine at the point you want to do the migration, things get harder. How do you pick up the stack context and reconstruct it on the target machine?

One way is to refuse to migrate until the entire context tears down (e.g. it returns to the "main loop"). Better would be to make use of the scripting language facilities to serialize and restore a coroutine. Various folks have shown that Stackless Python is capable of pickling an entity that has an outstanding context, and reconstructing it (either after a save/load, or after migration).

Has anyone tried "pickling" a coroutine in Lua? It's been on my todo list for a while. The question is whether references to (global) variables outside the coroutine can be "reattached" when it is unserialized on the other side. And what format is the coroutine "printed" and trasmitted as?

Tuesday, June 16, 2009

Off Topic: Most common solid waste?

OK. Weird thought... If you went to a landfill, what would be the most common solid waste you saw? A number of years back, a greeny asked that question and challenged people to go see for ourselves. It's been bugging me ever since. Here are some quick Googled results.

The most common waste product is paper (about 40 percent of the total). Other common components are: yard waste (green waste), plastics, metals, wood, glass and food waste. The composition of the municipal wastes can vary from region to region and from season to season. (U of Cal)

Paper, Organics (in Canada, from an Amazon "search inside" book)

Malaysia: Plastic waste is the most common solid waste that we generate in the country accounting for 7-12 percent by weight and 18-30 percent by volume of the total residential waste generated.

Throwaways (diapers) comprise 2 percent of the nation's solid waste by weight, making them the third most common solid waste item after newspapers and beverage and food containers. (diaper "activist" site, NY)

So "paper" is #1 and is quite recyclable, and energy rich. Plastic and organics too (possibly #2 and #3, if you include diapers; heh!).

Still doesn't answer what the "source" of the paper and plastic is. Fast food bags and wrappers? Retail boxes/packaging, grocery packaging, industrial/wholesale packaging? Books, junk mail, printouts?

I'd love to have a reference to a more detailed discussion of the source data. Not the reinterpreted-for-an-agenda summaries. And then make my own conclusions about reducing. (Maybe I should just look in *my* trash can at the end of the week).