Saturday, February 28, 2009

The Manifesto of Multithreading (for High Performance)

This document is a statement of principles that Emergent software developers adhere to concerning issues of concurrency, parallelism, multiprocessing and multithreading.
Related Docs:

  • Multi-threading on a single processor is beneficial only in rare circumstances such as when we expect a program to block repeatedly, such as doing I/O.
    • When concerned about performance, this is unlikely to be the case, so we should see only one or two threads created per core.
  • Operating system context switching between threads can be very expensive. It involves saving and loading processor state, tends to result in complete cache invalidation, and most expensively entails updating of OS process management data structures. For systems that provide virtual memory, reprogramming the MMU and swapping out kernel resources such as file pointers can make switching even more expensive. Some system calls can force a scheduling quantum to end.
    • Avoid context switching by keeping the number of threads low. Avoid system calls.
    • Decompose the problem into large segments to minimize the number of context switches and total overhead. This is most easily done by decomposing into as many pieces as possible, then using a policy to agglomerate them back into large sequential computations.
    • Reducing the number of threads will make the job of an SMP OS scheduler trivial and inexpensive. Ideally it could map a thread to a processor and leave it there indefinitely. Any load balancing needed can be accomplished by the application using app-specific knowledge without affecting the number of threads and their scheduling.
  • The measure of efficiency of using available multiprocessors is determined by how much processing time is wasted or spent idle.
    • For ease of measurement and tuning, an application should avoid consuming as much processing as possible, but should try to consume as little as possible, then idle, making inefficiencies much more visible.
    • Never use a spin lock.
    • Idle time of this sort can be filled by work scheduled to be done on processors that are already fully utilized or overloaded. This is the reason for load balancing and is one component of thread scheduling.
  • Processors interfere with each other’s efficiency when accessing shared data. This happens directly if the other processor hits a lock that is already acquired. If the blocked thread suspends, another thread on that processor may be able to take over. But that adds a context switch on top of the lock check. Mutual exclusion (mutex) mechanisms and other atomic instructions cause cache flushes on remote processors, or lock the entire memory bus. These costs are hidden when doing performance analysis since they don’t affect the requestor, and are not attributable to any single instruction on the remote processor (other than as apparently “spontaneous” cache misses). Even storage barriers used in out of order memory access modes can add overhead from cache effects.
    • Avoid the need for shared data. Either partition the data so it is exclusively used by one processor, or hand off entire blocks of data to a processor so the data has a single-writer at any one time.
    • Minimize the number of interactions with shared objects, as each interaction bears an overhead. This relates to the (TBD) agglomeration policy mechanism discussed above.
  • Even computations that only read data owned/written to by another thread concurrently must be guarded.
    • In some cases managing a replica of the data for remote readers will simplify programming (removing the need for the guards), and have other benefits similar to double buffering such as removing a serial sequencing.
  • Guarding disjoint blocks of code with a critical section mutex mechanism is error prone because the coordination code is not collocated. Overly conservative exclusion will impact performance. Surprising interactions (side-effects or reading shared objects) can lead to errors/races.
    • Avoid code-based mutual exclusion (critical sections) for commonly/widely accessed systems. If it is the “only” approach, consider centralizing it (e.g. single reader and writer), as opposed to requiring almost every method of an object to grab the mutex.
    • Consider whether the object being guarded must be a singleton or can be replicated per thread (in which case, no exclusion is required).
    • Don’t confuse thread exclusion with reentrancy.
  • Locks on data or code blocks that require more than one resource to be locked can lead to deadlocks, or priority inversion or self-blocking. If the deadlock is of low probability, it may not be observed in testing. Even though some consider it easy to detect and fix a deadlock, they are very risky since it is so hard to guarantee they don’t exist.
    • If you find yourself resolving conflicts from multiple locks, it is time to redesign the system.
  • Lock contention tends to not gracefully degrade in terms of performance and fairness/starvation. More sophisticated locking is required such as a ticket mutex for fairness. Simpler locks consume even more resources, mask their caller’s logical idleness, and cause memory performance side effects as threads busily contend for the lock.
    • Avoid high contention resources. Duplicate them, or rethink whether they are as shared as you think.
    • In the spirit of CSP (Communicating Sequential Processes), assign the ownership of the resource to a single thread and use message communication to access it.
    • Before devising a more sophisticated and special purpose mechanism or data structure to address the high contention, reconsider the larger problem. Often a more coarse approach using CSP will fit better with the rest of the system. The task to be parallelized may not turn out to significantly contribute to overall performance. Optimize globally.
  • Scheduling preemption should not be relied on to provide performance. Each forced context switch is an unwanted overhead. Near 100% of the CPU can be used by a single thread that has enough load assigned to it. Preemption should only be used for rarely invoked latency sensitive functions like I/O or for long running low priority background processing where the preemption rate can be very low.
Hoare’s Communicating Sequential Processes (CSP) can be used to solve any concurrency problem (any concurrent computation can be realized/reimplemented as CSP) and has a very easily understood mental model. A logical process (LP) contains its own state variables and performs computation concurrently to other LPs. LPs interact only via messages whose content and ownership is atomically transferred as it is sent.

By leaving the computation sequential, and avoiding all concurrency within each sequential process, algorithm development returns to the familiar Von Neumann Architecture. This approach is better for our customers as we do not require them to be concurrency experts when coding. It is better for Emergent since not all the developers need to be aware of the concurrency consequences in various systems they are less familiar with. Performance analysis becomes trivial, since algorithms are sequential, and message queuing can be analyzed to inspect workload. Critical paths and idealized parallelism analysis of the communication graph can be used to determine if a better load balance is possible, or if the problem itself needs to be further decomposed to realize a performance gain, or if more processors would improve performance.
  • A thread is treated as a CSP logic process. While it may share a single address space with other threads, by policy it does not share any writeable (and ideally, any readable) data.
  • The only point of concurrency is the message queue. A message with its content belongs to one logical process xor another.
  • A large amount of data that needs to be “shared” among threads is handed off in a message. Copying can be avoided by using a pointer in the message to the transferring data and adopting the policy that the data “belongs” to the message once it is attached, and belongs to the target logical process once it arrives. This effectively uses SMP shared memory as a communication medium.
  • The principles of CSP apply equally well in a multi-threaded shared-memory environment as in a multi-process SMP, in a NUMA or in a distributed processing environment. This future-proofs our software and allows reconfiguration and performance tuning by changing policies without rewriting code. This addresses application design and load changes as it evolves.
  • Minimizing the number of LPs reduces context switch overhead but requires better load balancing algorithms. A good static balance based on the expected application behavior can extract quite a lot of CPU capability, especially when the workload does not fluctuate very much over time. This should be a design goal even without considering concurrency to avoid frame rate jitter. Dynamic load balancing can use recent history to predict a better near term balance. However, to do this there is work-migration overhead. This takes the form of per-task “global” scheduling onto available processors (less desirable as this is a continuing cost), or periodic analysis, extracting and transferring workload between LPs. Note that in most cases it is not worth the effort of eking out the last amount of performance with sophisticated load balancing techniques.
  • Mapping similar work to an LP increases the likelihood of instruction cache hits. So when agglomerating small tasks, create large lists of similar tasks as opposed to doling them out round-robin.
  • Always optimize at the end. Getting perfect parallelism out of a system that is only 10% of the application computation is probably a waste of effort.
Other Concerns
Note that strict CSP suffers from communication overhead. For systems which are extremely fine-grained and latency sensitive, a custom approach might be considered if the extra work and risk are justified. But it should be sufficiently encapsulated to avoid having its tradeoffs unknowingly cause performance side effects elsewhere.

Concurrency for utility reasons like background loading can also benefit from techniques used for high performance concurrency, but being less performance sensitive, the benefit is primarily in reuse of common facilities and ease of development.

The CSP focused approach and infrastructure can be reused trivially to implement most kinds of parallelism: functional parallelism, data parallelism, master/worker thread parallelism, pipelined parallelism, etc. It is not appropriate for very fine grained parallelism such as parallel loops, or SIMD computation but that should be rarely needed when considering the application as a whole.

1 comment:

  1. I like your observations on a CSP approach, it is something I've been thinking about for some time. I think this is going to become more and more important, especially given the trend of increasing numbers of heterogeneous cores (i.e. CELL and Larrabeee).

    You mentioned that it is beneficial only rarely on a single processor machine. I'm wondering if the increased locality of LP state would increase performance WRT cache hits, etc.

    For example - say we limit each LP's private heap to 256k (or smaller) to mimic the SPU local store. On PC this has the added benefit that most of the LP work is done within the L2 cache, although this is probably worth benchmarking to see if it provides any benefit.

    Have you considered fibers in a Win32 based implementation rather than relying on threads? They're pretty quick to explicitly switch, although this does necessitate a custom scheduler being written (probably non-trivial).