Wednesday, December 3, 2008

Where are memory buses going?

In the 80's and 90's almost all supercomputers were distributed memory systems. Hypercubes, meshes, and a very few SIMD machines. Any concept of a shared address space was simulated. If there was limited support for remote memory access it was through slow and complex transport systems (one example is the BBN Butterfly).

Recently we see desktop machines with "many" cores. For ease of use, these are symmetric multiprocessors. Each processor is equally able to access any address. The interconnect is referred to as a bus, even when not physically implemented that way. There are sophisticated cache coherency mechanisms and inter-processor synchronization instructions which "lock the bus", or invalidate remote cache lines to make it possible to have atomic access to a line of memory for at least one operation (e.g. atomic increment or swap).

But these approaches don't scale (in the computer science sense). Even a "bus" that is a token ring or other network-like transport can only scale so far. Maybe 32 processors. I've seen SGI Origin 2000 and Sun Dragon machines (admittedly in the late 90's) that scaled this large and were still (mostly) symmetric. They used what amounted to a packet switched network and distributed systems techniques to provide atomicity and coherency.

Regardless, the most efficient use of these machines determined by emperical study (and common sense) was to segregate the memory disjointely among the processors. This made the caches more effective since they accessed less of the address space (avoiding issues with Translation Lookaside Buffers), and avoided synchronization issues. One must keep in mind that doing an atomic operation even when there is no current contention can dramatically affect N-1 other processors because the operation flushes the bus or remote cache lines, etc. In the end, we tended to not make use of the symmetry aspects.

So people now talk a lot about Non Uniform Memory Access. For example, blocks of RAM are tightly associated with a processors or small # of cores, but there is also a "global" bus that allows access to the entire address space of the machine. So you have the appearance of a multiprocessor machine, but some addresses are a lot slower to access. The right way to use these machines is identical to what we used to do. Have disjoint blocks of memory per processor (or tightly couple set of cores).

What is interesting about this evolution is that you can see it is moving hardware architecture toward a distributed computing model. The memory "buses" themselves are networks that can have multiple in-flight packets containing memory access requests. Some will have routing or bridging between different disjoint buses/networks within the one machine. But to effectively use this architecture it must be programmed as a distributed system.

Fortunately, we know how to do that. You use a collection of processes (distinct address spaces), and pass messages (optimized to use the high speed memory access bus/net). Communicating Sequential Processes. The beauty here is that such a software system can much more easily be tuned and reconfigured than a "monolithic" multithreaded application as hardware specs change (more processors, different local/remote memory access speeds...).

If you step back another step and think about physics, you can also easily convince yourself that this evolution is permanent. How much compute power can fit into a cubic block of space? It is limited by distance, heat, complexity density... The only way to "grow" that computing power will eventually be increasing the amount of space consumed. In terms of computer science scalability (i.e. taking it to the extreme), space grows as N^3. So we can see that at best, communication speed (the distance/radis to the furthest part of the one computer) and delay, factoring in the speed of light, will grow linearly while computing power will grow as the cube. Thus the scaling eventually is dominated by the communication. Direct communication and *no* synchronization would give the best performance. So we can conclude that distributed memory systems connected with network (even if they act like memory buses) will provide optimal performance.

That is where we are going. I say develop our software with that in mind. Threading seems like a good idea, but it is really a cheap hack that allows two separate processes to have a bunch of shared datastructures. Eventually those shared datastructures will have to be synchronized over a longer distance, so lets start doing that (e.g. create duplicates, watch for edits and send out updates). Using application specific knowledge this can be done *much* more efficiently than a symmetric memory system can, which sends every changed byte and more.

The connection to online games? I've just described a distributed object replication system. And that is the basis of the scalable online game architecture I've been outlining the whole time.

No comments:

Post a Comment