We can now see how an Entity can have a local cache populated with all the remote Entity proxies it needs by using Interest Management and data distribution over a publish and subscribe message system.
One tough problem remains. What do you do about race conditions if two simulators modify the same Property on an Entity at around the same time? Lightly digging into this reveals a decent solution where such writes would be resolved at a central location and distributed from there so all consumers see the same ordering. However, allowing multiple writers to a single Property can lead to inconsistencies that need sophisticated transaction management.
The easier approach is to disallow multiple writers. Ensure that all properties of an Entity are modified only by that Entity. Any other Entity that wants to make a change must send a request. This boils down to "Communicating Sequential Processes", and is a well-understood computer science paradigm. Normally the Entity stays on one host for a good period of time, and the Entity is said to be Owned by the associated simulator.
The owning Simulator is said to be the authoritative simulator. All computation that affects that Entity is performed on the owning simulator. The values it computes are pushed out to other interested simulators where they become proxies/reflections/replicas, and are read-only.
The single writer paradigm allows a junior game content developer to remain blissfully unaware of concurrency. They think about one Entity at a time. An interaction with another Entity is not trying to read or write to a concurrently evolving set of variables. Instead it is sending a request to the other Entity, which will eventually get around to handling the request sequentially. The developer can think in single-threaded terms. Yay! In fact, the simulator is also made single-threaded so there can be no mistakes (note this still leaves ways to make good use of multiple cores).
The behavior that is running on an Entity is able to immediately read any of the Properties of Entities in which it has already expressed an interest. Since the simulator is single threaded, this can be done without locks. The properties of the proxies are only updated when the message system is ticked, and since the simulator is single-threaded, that is done after the Entities are done executing. Note that because we use state push, the property values of the proxies have the lowest latency *possible*. We can also apply latency hiding techniques to further improve the proxy's estimate of the value on the authoritative simulator.
All this results in a very accurate and familiar representation of a computing environment that appears to have all Entities on the same machine. But since it is actually distributed, its performance will scale. The distributed nature is abstracted away without impacting the developer.
If you are thinking about multi-entity transactions, you'll have to wait for it...
Small and large scale online game techniques. Parallel and distributed systems. Technical philosophy. Some software engineering.
Friday, December 12, 2008
Monday, December 8, 2008
Publish/Subscribe Message Delivery
In a previous post, I argued that publish/subscribe was the only tricky thing needed for a totally flexible interest management based online game system.
Publish/subscribe (producer/consumer) based message systems give semantics similar to multicast. A producer sends a message to a channel. All current consumers on that channel receive a copy of that message. To avoid becoming broadcast (where every consumer receives every message sent), the messages are decomposed into channels using a Category, one per channel (so you can think of it as a channel id). A Category is an integer so they are trivial to deal with at the lower level (as opposed to strings or something). For simplicity, each message is only sent to one Category.
This system is very loosely coupled giving it a lot of flexibility and extensibility. A producer does not need to know the existence of any of the consumers. The set of consumers and their implementation can change without touching the producer. For example, a logging system could be attached to a channel without affecting the system, and would give good data for debugging.
To implement the publish/subscribe system efficiently, we must manage the producer and consumer subscription requests efficiently. Broadcasting that a consumer is interested in some Category to each producer is too inefficient. So we introduce the notion of a channel manager that keeps track of the interests of all producers and consumers.
The channel manager is responsible for redistributing each data message. A producer sends a message to the channel manager. The channel manager maintains the list of interested consumers, and forwards a copy of the producer's message to each consumer. We have exchanged the non-scalable broadcast of subscription messages for an extra hop of latency for each data message.
The channel manager can easily be made scalable. The simplest approach is to use the integer Category value and a simple a modulus operation to load balance across any number of channel manager processes. Both producers and consumers use the same computation. And all subscriber messages and all data messages on one Category travel through a single channel manager.
This architecture is the obvious one. There are more sophisticated approaches that can reduce the two hop latency by using a direct connections between producers and consumers. The subscription messages still need to route through a channel manager, but the producers need to maintain the list of interested consumers. This adds the requirement that producers subscribe to produce, and adds more subscription messages and more latency on a subscription. There are also subtle data message ordering problems.
If you want to go nuts, you could use real multicast. The challenge there is that there are limited numbers of multicast groups. So you have to solve the problem of multiple channels sharing one multicast group.
So you get to choose. Easy implementation or optimized but tricky implementation. Like most code. In this case I argue that the simple approach has good enough performance for the needs of online games. The producers and channel manager live in a data center on hosts attached to a high speed switch, so network latency is minuscule.
The design philosophy of this system is to minimize unnecessary computation due to unwanted messages arriving on a host that are just thrown away. Hosts cost money. Bandwidth inside the data center is free. So good interest management is key.
So. We have sliced off the publish/subscribe problem. All we have left is how to approach interest management policies which are application specific.
Publish/subscribe (producer/consumer) based message systems give semantics similar to multicast. A producer sends a message to a channel. All current consumers on that channel receive a copy of that message. To avoid becoming broadcast (where every consumer receives every message sent), the messages are decomposed into channels using a Category, one per channel (so you can think of it as a channel id). A Category is an integer so they are trivial to deal with at the lower level (as opposed to strings or something). For simplicity, each message is only sent to one Category.
This system is very loosely coupled giving it a lot of flexibility and extensibility. A producer does not need to know the existence of any of the consumers. The set of consumers and their implementation can change without touching the producer. For example, a logging system could be attached to a channel without affecting the system, and would give good data for debugging.
To implement the publish/subscribe system efficiently, we must manage the producer and consumer subscription requests efficiently. Broadcasting that a consumer is interested in some Category to each producer is too inefficient. So we introduce the notion of a channel manager that keeps track of the interests of all producers and consumers.
The channel manager is responsible for redistributing each data message. A producer sends a message to the channel manager. The channel manager maintains the list of interested consumers, and forwards a copy of the producer's message to each consumer. We have exchanged the non-scalable broadcast of subscription messages for an extra hop of latency for each data message.
The channel manager can easily be made scalable. The simplest approach is to use the integer Category value and a simple a modulus operation to load balance across any number of channel manager processes. Both producers and consumers use the same computation. And all subscriber messages and all data messages on one Category travel through a single channel manager.
This architecture is the obvious one. There are more sophisticated approaches that can reduce the two hop latency by using a direct connections between producers and consumers. The subscription messages still need to route through a channel manager, but the producers need to maintain the list of interested consumers. This adds the requirement that producers subscribe to produce, and adds more subscription messages and more latency on a subscription. There are also subtle data message ordering problems.
If you want to go nuts, you could use real multicast. The challenge there is that there are limited numbers of multicast groups. So you have to solve the problem of multiple channels sharing one multicast group.
So you get to choose. Easy implementation or optimized but tricky implementation. Like most code. In this case I argue that the simple approach has good enough performance for the needs of online games. The producers and channel manager live in a data center on hosts attached to a high speed switch, so network latency is minuscule.
The design philosophy of this system is to minimize unnecessary computation due to unwanted messages arriving on a host that are just thrown away. Hosts cost money. Bandwidth inside the data center is free. So good interest management is key.
So. We have sliced off the publish/subscribe problem. All we have left is how to approach interest management policies which are application specific.
Labels:
Interest Management,
Load balancing,
Scalability
Wednesday, December 3, 2008
Where are memory buses going?
In the 80's and 90's almost all supercomputers were distributed memory systems. Hypercubes, meshes, and a very few SIMD machines. Any concept of a shared address space was simulated. If there was limited support for remote memory access it was through slow and complex transport systems (one example is the BBN Butterfly).
Recently we see desktop machines with "many" cores. For ease of use, these are symmetric multiprocessors. Each processor is equally able to access any address. The interconnect is referred to as a bus, even when not physically implemented that way. There are sophisticated cache coherency mechanisms and inter-processor synchronization instructions which "lock the bus", or invalidate remote cache lines to make it possible to have atomic access to a line of memory for at least one operation (e.g. atomic increment or swap).
But these approaches don't scale (in the computer science sense). Even a "bus" that is a token ring or other network-like transport can only scale so far. Maybe 32 processors. I've seen SGI Origin 2000 and Sun Dragon machines (admittedly in the late 90's) that scaled this large and were still (mostly) symmetric. They used what amounted to a packet switched network and distributed systems techniques to provide atomicity and coherency.
Regardless, the most efficient use of these machines determined by emperical study (and common sense) was to segregate the memory disjointely among the processors. This made the caches more effective since they accessed less of the address space (avoiding issues with Translation Lookaside Buffers), and avoided synchronization issues. One must keep in mind that doing an atomic operation even when there is no current contention can dramatically affect N-1 other processors because the operation flushes the bus or remote cache lines, etc. In the end, we tended to not make use of the symmetry aspects.
So people now talk a lot about Non Uniform Memory Access. For example, blocks of RAM are tightly associated with a processors or small # of cores, but there is also a "global" bus that allows access to the entire address space of the machine. So you have the appearance of a multiprocessor machine, but some addresses are a lot slower to access. The right way to use these machines is identical to what we used to do. Have disjoint blocks of memory per processor (or tightly couple set of cores).
What is interesting about this evolution is that you can see it is moving hardware architecture toward a distributed computing model. The memory "buses" themselves are networks that can have multiple in-flight packets containing memory access requests. Some will have routing or bridging between different disjoint buses/networks within the one machine. But to effectively use this architecture it must be programmed as a distributed system.
Fortunately, we know how to do that. You use a collection of processes (distinct address spaces), and pass messages (optimized to use the high speed memory access bus/net). Communicating Sequential Processes. The beauty here is that such a software system can much more easily be tuned and reconfigured than a "monolithic" multithreaded application as hardware specs change (more processors, different local/remote memory access speeds...).
If you step back another step and think about physics, you can also easily convince yourself that this evolution is permanent. How much compute power can fit into a cubic block of space? It is limited by distance, heat, complexity density... The only way to "grow" that computing power will eventually be increasing the amount of space consumed. In terms of computer science scalability (i.e. taking it to the extreme), space grows as N^3. So we can see that at best, communication speed (the distance/radis to the furthest part of the one computer) and delay, factoring in the speed of light, will grow linearly while computing power will grow as the cube. Thus the scaling eventually is dominated by the communication. Direct communication and *no* synchronization would give the best performance. So we can conclude that distributed memory systems connected with network (even if they act like memory buses) will provide optimal performance.
That is where we are going. I say develop our software with that in mind. Threading seems like a good idea, but it is really a cheap hack that allows two separate processes to have a bunch of shared datastructures. Eventually those shared datastructures will have to be synchronized over a longer distance, so lets start doing that (e.g. create duplicates, watch for edits and send out updates). Using application specific knowledge this can be done *much* more efficiently than a symmetric memory system can, which sends every changed byte and more.
The connection to online games? I've just described a distributed object replication system. And that is the basis of the scalable online game architecture I've been outlining the whole time.
Recently we see desktop machines with "many" cores. For ease of use, these are symmetric multiprocessors. Each processor is equally able to access any address. The interconnect is referred to as a bus, even when not physically implemented that way. There are sophisticated cache coherency mechanisms and inter-processor synchronization instructions which "lock the bus", or invalidate remote cache lines to make it possible to have atomic access to a line of memory for at least one operation (e.g. atomic increment or swap).
But these approaches don't scale (in the computer science sense). Even a "bus" that is a token ring or other network-like transport can only scale so far. Maybe 32 processors. I've seen SGI Origin 2000 and Sun Dragon machines (admittedly in the late 90's) that scaled this large and were still (mostly) symmetric. They used what amounted to a packet switched network and distributed systems techniques to provide atomicity and coherency.
Regardless, the most efficient use of these machines determined by emperical study (and common sense) was to segregate the memory disjointely among the processors. This made the caches more effective since they accessed less of the address space (avoiding issues with Translation Lookaside Buffers), and avoided synchronization issues. One must keep in mind that doing an atomic operation even when there is no current contention can dramatically affect N-1 other processors because the operation flushes the bus or remote cache lines, etc. In the end, we tended to not make use of the symmetry aspects.
So people now talk a lot about Non Uniform Memory Access. For example, blocks of RAM are tightly associated with a processors or small # of cores, but there is also a "global" bus that allows access to the entire address space of the machine. So you have the appearance of a multiprocessor machine, but some addresses are a lot slower to access. The right way to use these machines is identical to what we used to do. Have disjoint blocks of memory per processor (or tightly couple set of cores).
What is interesting about this evolution is that you can see it is moving hardware architecture toward a distributed computing model. The memory "buses" themselves are networks that can have multiple in-flight packets containing memory access requests. Some will have routing or bridging between different disjoint buses/networks within the one machine. But to effectively use this architecture it must be programmed as a distributed system.
Fortunately, we know how to do that. You use a collection of processes (distinct address spaces), and pass messages (optimized to use the high speed memory access bus/net). Communicating Sequential Processes. The beauty here is that such a software system can much more easily be tuned and reconfigured than a "monolithic" multithreaded application as hardware specs change (more processors, different local/remote memory access speeds...).
If you step back another step and think about physics, you can also easily convince yourself that this evolution is permanent. How much compute power can fit into a cubic block of space? It is limited by distance, heat, complexity density... The only way to "grow" that computing power will eventually be increasing the amount of space consumed. In terms of computer science scalability (i.e. taking it to the extreme), space grows as N^3. So we can see that at best, communication speed (the distance/radis to the furthest part of the one computer) and delay, factoring in the speed of light, will grow linearly while computing power will grow as the cube. Thus the scaling eventually is dominated by the communication. Direct communication and *no* synchronization would give the best performance. So we can conclude that distributed memory systems connected with network (even if they act like memory buses) will provide optimal performance.
That is where we are going. I say develop our software with that in mind. Threading seems like a good idea, but it is really a cheap hack that allows two separate processes to have a bunch of shared datastructures. Eventually those shared datastructures will have to be synchronized over a longer distance, so lets start doing that (e.g. create duplicates, watch for edits and send out updates). Using application specific knowledge this can be done *much* more efficiently than a symmetric memory system can, which sends every changed byte and more.
The connection to online games? I've just described a distributed object replication system. And that is the basis of the scalable online game architecture I've been outlining the whole time.
Subscribe to:
Posts (Atom)