Saturday, May 19, 2012

Why I don't like threads

People say I'm crazy because I don't like threads in this day and age of multicore processors. But I have good reason based on years of experience, so hear me out.

Doing threading well is harder than most people think. Making a decision to multithread your application imposes on all developers, including those less experienced with such things than the ones normally making that decision. I like to think that framework and server developers are much more experienced with such things than game developers who might be plugging stuff into the server. And framework developers are building their framework to be used by those game play programmers. Play to your audience. Wouldn't it be better if they never had to know about threading issues? I believe it is possible to build a framework that insulates regular developers from such concerns (the communicating sequential process model).

Now I'm a big fan of using threading for what it is good for: dealing with interrupts that should be serviced asap without polluting your main application with lots of polling and checking. Examples of this would be a background thread for servicing the network (you want to drain the network buffers quickly so they don't overflow and back things up and cause retransmissions); background file or keyboard I/O which needs to rarely wake up and service an incoming or outgoing IO buffer; remote requests that block for a long time and would otherwise stall the app (like a DB request, or http request). Note in particular that none of these are high performance computations. They are really dominated by the blocking/waiting time. The use of a thread in this case is really all about an easier programming model. The background thread can be written as a loop that reads or writes using blocking, and since it is not done on the main thread, the main thread doesn't have checks and polling sprinkled in.

Most of the time, when you have some heavy computation to do, it will eventually scale up to require more than a single machine anyway. So you are going to have to write your system to be distributed and communicate between the machines using messages anyway. If you have already done that work, you can easily use it within a single machine that has many cores. If you try to build a system that makes use of the many cores within a machine by using threads, and you also solve the distributed case, you've doubled your work and maintenance effort. One of the best ways to decompose a problem to be solved by worker threads is to deliver the work in a task queue and have them contend for it. As each is pulled off, it is processed by a handler function. This is exactly the same approach you would use for the distributed case. So the only difference is that in one case you have messages passing between processes on the same machine, or task-messages passing between threads. Yes, I understand the performance difference. But if your app is that performance sensitive, the inter-process message passing can be implemented using shared memory and avoid the kernel switches when delivering a message to the current machine. The intent here is to save you from the double implementation, save your framework users from having to deal with thread programming, and the performance difference is pretty small.

There is also a big problem with heavily threaded apps in production. There are pretty lousy tools for helping you figure out thread related performance problems. When you are only dealing with background interrupt-handling threads, there are not going to be serious performance problems unless one of the background threads starts polling wildly and consuming 100% cpu. But if a highly multithreaded app starts using too much CPU, or starts being unresponsive, how do you tell what is actually happening among the various threads? They don't have names, and the kernel isn't very good about helping you keep track of what work happens on each thread. You wind up having to build your own instrumentation into the application every time there is such a problem. And doing a new build and getting it into production is a lot of effort. On the other hand, if you follow the distributed model, you can easily see which process is spiking CPU. You can easily instrument the message flows between processes to see if there is too much or too little inter-process communication. Often you wind up logging all such traffic for post-mortem analysis anyway. Remember, you are not likely to have the luxury of attaching a debugger to a production process to grab stack traces, or what not. So you are going to be staring at a monolithic multithreaded app and trying to guess what is going on inside.

Problems in threaded apps tend to be subtle, they wind up being hard to debug, and often only show up after the app has been running at production loads for quite a while. Writing good threaded software is the responsibility of every programmer that touches an app that adopts it, and the least experienced programmer in that app is the one you have to worry about. Operating and debugging the app in production is not easy. These are the reasons I don't like threads. I'm not afraid of them. I understand them all too well. I think there are better ways (CSP) that are actually easier and faster to develop for in the first place. And you are likely to have to adopt those ways in any case as you scale beyond a single machine.

More thoughts on this subject here: http://onlinegametechniques.blogspot.com/2009/02/manifesto-of-multithreading-for-high.html

(Any position statement like this is going to sound a little nuts if you try to apply to every conceivable situation. What I was thinking about when I wrote this was a server application, in particular, something large scale, and event oriented. If, for example, you are writing a graphical client on a 360, using multiple processes would be looney. Multiple processes listening on the same socket can be a problem (with some workarounds). You might not be able to abide even a shared memory message passing delay between components, like in a rendering pipeline. Interestingly, these examples are all amenable to the same analysis: what is the response time required, what resources are being shared, how much will the computation have to scale up, is the physical hardware intrinsically distributed anyway? My point is the default answer should be to encapsulate any threading you *have to* do so that the bulk of your development doesn't have to pay the daily overhead of always asking: is that line of code accessing anything shared; is this data structure thread safe? It slows down important conversations, and it leaves shadows of a doubt everywhere.)

5 comments:

  1. Completely agree with this. BTW, have you looked at Grand Central Dispatch at all? Their model is similar to what you describe -- there is no direct shared memory. All interthread communication happens via message passing. (Possibly the library is hiding the sharing of the message object from you, but it seems to do a reasonable job of it.)

    The D Programming Language also advocates this model, and they require shared memory to be immutable (!).

    ReplyDelete
  2. One of the first big changes Todd and I made here was to collapse a two-thread server model down to one. Not only was it much simpler to deal with, but it was *also* faster!

    ReplyDelete
  3. I'm coming to be a big fan of job queues. Properly architected, you can then run it with as many threads as you want, then tune your application to optimal performance under real-world conditions.

    ReplyDelete
  4. You want to avoid concurrency within a single Entity so you don't have to lock your own data, and so you can guarantee event ordering. So you want job queues that understand the consumers like that. Essentially providing CSP semantics. Of course, you can tune the number of processes up and down the same way.

    ReplyDelete
  5. Threading can certainly be a challenge to do right. Resource locking, message passing, copying data between caches associated with different cores, and context switches from too many threads, can all make hits on performance. A good single thread design could easily do better than a poor multi-thread design.

    - More threads than processor cores is generally a bad idea.

    - Spinlocks may be a faster option than a lock that causes a context switch. This depends on lock conflicts being the exception.

    - Having the right granularity on locks: course enough to minimize the number of locks to be processed, fine enough to make lock contention uncommon.

    - Don't pass messages between threads carelessly. Anything other than persistent state data should probably be kept local. Job queues can be in conflict with this.

    - If your design ever requires multiple locks at the same time by a thread, the Dining Philosophers problem makes things far more complicated. Consider a redesign.

    ReplyDelete