Tuesday, May 13, 2008

Multi-threading: the mine field is ahead

You think you master multi-thread programming ? CPU makers are asked to improve CPU efficiency, without paying
attention to the software behind who needs to cope with the new features. To put it clear, CPU makers are working on your next traps.
  1. memory access swapping
  2. instruction reordering
  3. asymmetric cores and unfair memory access

A read instruction is a lot faster than a write, and the memory bus cannot perform in both directions at the same time. Memory access, even a read, is still slower than CPU basic instructions. To keep the CPU busy, it may be needed to swap some memory access. For instance if the CPU has enough data in the cache to run without other memory access, it's a good time to perform an expensive write, even if read access are scheduled first. Also, some instructions can be reordered to optimize the processing pipeline and take advantage of the co-processors.

These arrangements inside the CPU are managed by a code analyzer which detects commutativity of sequence of instructions, and independence between variables. This analyzer works in a single thread context, so it's up to the developer to identify multi-thread issues and to forbid the CPU to arrange some parts of the code by synchronizing code and protecting data access.

Examples are numerous, here is a simple one, in a symbolic language

boolean ready = false;
int result = 0;

thread1:
while (!ready) {
Thread.yield();
}
print(result);

thread2:
result = 11;
ready = true;



The human understands quickly the purpose of this program. Thread2 is producing a result, toggling a boolean when completed. The other thread waits for the result to be ready before consuming it. But without any protection, this program can print 11, 0, of even nothing because thread1 loops forever!

That's because if you imagine yourself a single core executing thread2, you don't bother writing back the values result and ready because you don't need them afterward. Only a context switch forces you to flush the cache and then to update the memory location of these variables.

The core running thread1 is even worse. ready and result are not bound by an expression. They are independent so they can be fetched in any order. In particular, if result is read before ready, it prints 0. Also ready, once loaded on the cache may never been refreshed from the main memory, which leads to the infinite loop.

Sometimes it runs just like human expected, and the program prints 11. But that's a lucky execution, actually...

Last mine in our field, for the moment, is the asymetric architecture: a core has some privileges, runs faster or has more cache, or gets higher priority to the memory bus. It's a nice idea. An important thread can run faster than the other. But once again the software is far behind. OS kernel are usually SMP (symetrical multi-processor), and dispatch processes in a fair way across the processors, which is exactly what we don't want. Same for the higher level.

Assuming we have the possibility to pinpoint an important thread in the best core, the developer still have to think again the multi-thread logic to optimize the core assignments. Still good development time to go...