Saturday, February 28, 2009

The Manifesto of Multithreading (for High Performance)

This document is a statement of principles that Emergent software developers adhere to concerning issues of concurrency, parallelism, multiprocessing and multithreading.
Related Docs:

  • Multi-threading on a single processor is beneficial only in rare circumstances such as when we expect a program to block repeatedly, such as doing I/O.
    • When concerned about performance, this is unlikely to be the case, so we should see only one or two threads created per core.
  • Operating system context switching between threads can be very expensive. It involves saving and loading processor state, tends to result in complete cache invalidation, and most expensively entails updating of OS process management data structures. For systems that provide virtual memory, reprogramming the MMU and swapping out kernel resources such as file pointers can make switching even more expensive. Some system calls can force a scheduling quantum to end.
    • Avoid context switching by keeping the number of threads low. Avoid system calls.
    • Decompose the problem into large segments to minimize the number of context switches and total overhead. This is most easily done by decomposing into as many pieces as possible, then using a policy to agglomerate them back into large sequential computations.
    • Reducing the number of threads will make the job of an SMP OS scheduler trivial and inexpensive. Ideally it could map a thread to a processor and leave it there indefinitely. Any load balancing needed can be accomplished by the application using app-specific knowledge without affecting the number of threads and their scheduling.
  • The measure of efficiency of using available multiprocessors is determined by how much processing time is wasted or spent idle.
    • For ease of measurement and tuning, an application should avoid consuming as much processing as possible, but should try to consume as little as possible, then idle, making inefficiencies much more visible.
    • Never use a spin lock.
    • Idle time of this sort can be filled by work scheduled to be done on processors that are already fully utilized or overloaded. This is the reason for load balancing and is one component of thread scheduling.
  • Processors interfere with each other’s efficiency when accessing shared data. This happens directly if the other processor hits a lock that is already acquired. If the blocked thread suspends, another thread on that processor may be able to take over. But that adds a context switch on top of the lock check. Mutual exclusion (mutex) mechanisms and other atomic instructions cause cache flushes on remote processors, or lock the entire memory bus. These costs are hidden when doing performance analysis since they don’t affect the requestor, and are not attributable to any single instruction on the remote processor (other than as apparently “spontaneous” cache misses). Even storage barriers used in out of order memory access modes can add overhead from cache effects.
    • Avoid the need for shared data. Either partition the data so it is exclusively used by one processor, or hand off entire blocks of data to a processor so the data has a single-writer at any one time.
    • Minimize the number of interactions with shared objects, as each interaction bears an overhead. This relates to the (TBD) agglomeration policy mechanism discussed above.
  • Even computations that only read data owned/written to by another thread concurrently must be guarded.
    • In some cases managing a replica of the data for remote readers will simplify programming (removing the need for the guards), and have other benefits similar to double buffering such as removing a serial sequencing.
  • Guarding disjoint blocks of code with a critical section mutex mechanism is error prone because the coordination code is not collocated. Overly conservative exclusion will impact performance. Surprising interactions (side-effects or reading shared objects) can lead to errors/races.
    • Avoid code-based mutual exclusion (critical sections) for commonly/widely accessed systems. If it is the “only” approach, consider centralizing it (e.g. single reader and writer), as opposed to requiring almost every method of an object to grab the mutex.
    • Consider whether the object being guarded must be a singleton or can be replicated per thread (in which case, no exclusion is required).
    • Don’t confuse thread exclusion with reentrancy.
  • Locks on data or code blocks that require more than one resource to be locked can lead to deadlocks, or priority inversion or self-blocking. If the deadlock is of low probability, it may not be observed in testing. Even though some consider it easy to detect and fix a deadlock, they are very risky since it is so hard to guarantee they don’t exist.
    • If you find yourself resolving conflicts from multiple locks, it is time to redesign the system.
  • Lock contention tends to not gracefully degrade in terms of performance and fairness/starvation. More sophisticated locking is required such as a ticket mutex for fairness. Simpler locks consume even more resources, mask their caller’s logical idleness, and cause memory performance side effects as threads busily contend for the lock.
    • Avoid high contention resources. Duplicate them, or rethink whether they are as shared as you think.
    • In the spirit of CSP (Communicating Sequential Processes), assign the ownership of the resource to a single thread and use message communication to access it.
    • Before devising a more sophisticated and special purpose mechanism or data structure to address the high contention, reconsider the larger problem. Often a more coarse approach using CSP will fit better with the rest of the system. The task to be parallelized may not turn out to significantly contribute to overall performance. Optimize globally.
  • Scheduling preemption should not be relied on to provide performance. Each forced context switch is an unwanted overhead. Near 100% of the CPU can be used by a single thread that has enough load assigned to it. Preemption should only be used for rarely invoked latency sensitive functions like I/O or for long running low priority background processing where the preemption rate can be very low.
Hoare’s Communicating Sequential Processes (CSP) can be used to solve any concurrency problem (any concurrent computation can be realized/reimplemented as CSP) and has a very easily understood mental model. A logical process (LP) contains its own state variables and performs computation concurrently to other LPs. LPs interact only via messages whose content and ownership is atomically transferred as it is sent.

By leaving the computation sequential, and avoiding all concurrency within each sequential process, algorithm development returns to the familiar Von Neumann Architecture. This approach is better for our customers as we do not require them to be concurrency experts when coding. It is better for Emergent since not all the developers need to be aware of the concurrency consequences in various systems they are less familiar with. Performance analysis becomes trivial, since algorithms are sequential, and message queuing can be analyzed to inspect workload. Critical paths and idealized parallelism analysis of the communication graph can be used to determine if a better load balance is possible, or if the problem itself needs to be further decomposed to realize a performance gain, or if more processors would improve performance.
  • A thread is treated as a CSP logic process. While it may share a single address space with other threads, by policy it does not share any writeable (and ideally, any readable) data.
  • The only point of concurrency is the message queue. A message with its content belongs to one logical process xor another.
  • A large amount of data that needs to be “shared” among threads is handed off in a message. Copying can be avoided by using a pointer in the message to the transferring data and adopting the policy that the data “belongs” to the message once it is attached, and belongs to the target logical process once it arrives. This effectively uses SMP shared memory as a communication medium.
  • The principles of CSP apply equally well in a multi-threaded shared-memory environment as in a multi-process SMP, in a NUMA or in a distributed processing environment. This future-proofs our software and allows reconfiguration and performance tuning by changing policies without rewriting code. This addresses application design and load changes as it evolves.
  • Minimizing the number of LPs reduces context switch overhead but requires better load balancing algorithms. A good static balance based on the expected application behavior can extract quite a lot of CPU capability, especially when the workload does not fluctuate very much over time. This should be a design goal even without considering concurrency to avoid frame rate jitter. Dynamic load balancing can use recent history to predict a better near term balance. However, to do this there is work-migration overhead. This takes the form of per-task “global” scheduling onto available processors (less desirable as this is a continuing cost), or periodic analysis, extracting and transferring workload between LPs. Note that in most cases it is not worth the effort of eking out the last amount of performance with sophisticated load balancing techniques.
  • Mapping similar work to an LP increases the likelihood of instruction cache hits. So when agglomerating small tasks, create large lists of similar tasks as opposed to doling them out round-robin.
  • Always optimize at the end. Getting perfect parallelism out of a system that is only 10% of the application computation is probably a waste of effort.
Other Concerns
Note that strict CSP suffers from communication overhead. For systems which are extremely fine-grained and latency sensitive, a custom approach might be considered if the extra work and risk are justified. But it should be sufficiently encapsulated to avoid having its tradeoffs unknowingly cause performance side effects elsewhere.

Concurrency for utility reasons like background loading can also benefit from techniques used for high performance concurrency, but being less performance sensitive, the benefit is primarily in reuse of common facilities and ease of development.

The CSP focused approach and infrastructure can be reused trivially to implement most kinds of parallelism: functional parallelism, data parallelism, master/worker thread parallelism, pipelined parallelism, etc. It is not appropriate for very fine grained parallelism such as parallel loops, or SIMD computation but that should be rarely needed when considering the application as a whole.

Sunday, February 1, 2009

The Simulator is Authoritative: DB-Centric Sucks

I have seen several persistent online games that chose to rely on the game state persistence database to manage the concurrency that results from multiple simulators which are needed for scale. Their idea is to use DB abort-retry semantics to serialize all access to the Entity states. This sort of works and is pretty easy to implement, but introduces too many problems to justify it.

The first problem to overcome is that read-modify-write for every state change through to the DB would be prohibitively slow. Such things would need to be in a single transaction otherwise you'd have race conditions for concurrent access by multiple simulators. (See our comments on single-writer). I suspect DB-centric implementations wind up adopting a single-writer policy to try to avoid this problem, leading to the question of why be DB centric in the first place?

There are a couple of tantalizing benefits to DB centric that might make it seem attractive. The one I've seen most closely is that it makes Entity migration very simple. A simulator unloads the Entity; the DB becomes the only copy of the Entity; the new simulator locks the ownership of that Entity and loads it. Using a DB in this way, race conditions on this hand off are impossible.

There are hidden costs of a DB-centric approach for migration. Supporting Entity migration through the DB means that *every* Entity state property must be persisted to the DB, otherwise the restoration onto the destination simulator will effectively "reset" the Entity. This can have significant undesirable performance implications, since data that is not needed for longer term persistence (e.g. when the player logs back in next weekend) must be written through to the DB in case the Entity migrates. It also means that if an NPC or other Entity that is not persisted across shard shutdown must also be persisted if it needs to be migratable. These things result in wasted space and wasted DB throughput. In my experience, DB throughput is the limiting factor in scaling a shard. Please, please, run screaming from DB-centric!

Another justification is if a simulator crashes (and don't fool yourself, they *all* do!), then very little work is lost because every change is being written through. But consider the complication of cross-simulator transactions. For Entities to interact that are on different simulators, every DB interaction in the cluster must be serialized, and that can get super-slow. I judge that for almost everything, players won't quit over losing a few minutes of game play.

At this point the DB-centric guys object and say: well, actually we don't pay the write-to-DB round trip, we use a distributed database with local caching in-memory for high-performance and more immediate local access. The problem there is that if the local in-memory DB crashes, you lose data anyway. And worse, the data that was persisted to disk may not be a shard-wide consistent snapshot of the Entities' states.

I have observed many DB failures (mainly user error or hardware failure), even with high-cost Oracle installations. A DB admin does a query, or a developer writes some "custom" query, and it has unexpected performance implications, or an unsuspected table deadlock that only shows up at full load. The DB engine detects this deadlock (after a while), but there is a significant hiccup. A DB centric game shard locks up completely, since the simulators are not authoritative. Conversely, a simulator-centric architecture allows the shard to keep running even with the DB shut down! This is an MMO operator's dream come true and can be used for things like defragmenting table space or other maintenance like backups.

Note that DB latency would live on the critical path of responsiveness in many cases. We've also put unnecessary extra load on the DB. Given that DB response times can be quite variable when there is a mixture of different query types, this puts a lot of pretty difficult consequences on the rest of the system.

Bottom line: make the Simulator authoritative even over the game state database. The DB is just a backing-store for use when the shard restarts or a player logs back in. The DB holds just the part of the world that needs to persist. This obeys a rule of thumb for good scaling, particularly in distributed systems: make performance controlled by configuration, not by the application data. In this case, the rate of persistence and the resulting amount of lost game play on a simulator crash is controlled by a configuration file and is tunable no matter what the load on the shard. A DB-centric approach is at the mercy of the number of players, number of Entities, the rate of change of Properties determined by Entity Behavior scripts and all kinds of other things that are definitely not independently tunable.