Om nom NUMA

While tuning BIOS settings on my old workstation PC with two CPU sockets, I noticed a setting I hadn’t touched in a long time if ever — the “memory interleaving mode” could be tuned for either SMP or NUMA, and was set on SMP.

What… what does that mean? SMP is Symmetric Multi-Processing and I’ve heard that term since ages past in Linux-land for handling multiple CPUs and cores. NUMA is just some server thing right?

So… NUMA is Non-Uniform Memory Access, and here specifically refers to the fact that each CPU socket has its own connection to its share of system memory that’s slightly faster than accessing the same memory through the other CPU socket.

With the BIOS’s memory interleave mode set to NUMA, that speed differential is exposed to the operating system, and memory from each processor is assigned to a separate region of physical memory addressing. This means the OS can assign memory and processor time to any given process optimized for speed as much as possible, only slowing down if a given process runs out of stuff fitting on one socket. Cool right?

Meanwhile with it set to SMP, the memory is laid out interleaved, so any given piece of memory might be fast, or it might be slow. Lame right?

So.

I tried it for fun, on Linux and Windows both and at first didn’t see much difference. Using valgrind’s “cachegrind” tool confirmed that the things I was testing (PHP in a tight interpreter loop, or PNG compression) were mostly working in-cache and so memory latency wasn’t a big deal, and memory bandwidth is nooooowhere near being saturated.

Then I found a case where NUMA mode fell down badly: multithreaded app with more threads than physical CPU cores on a single socket.

Running my PNG compression tests at 1, 2, 4, or 16 threads ran about as fast with SMP or NUMA mode. But at 8, there was a big dip in speed.

Since I have 2x quad-core processors with hyper-threading, the layout is:

  • 2 sockets
    • 4 cores per socket
      • 2 threads per core

SMP mode assigns threads to logical processors like this:

  • 1 thread – runs on either socket
  • 2 threads – one on each socket
  • 4 threads – two on each socket, on separate cores
  • 8 threads – four on each socket, on separate cores
  • 16 threads – eight on each socket, packing all cores’ threads

NUMA mode prefers to group them together, because they’re all in a single process:

  • 1 thread – runs on either socket
  • 2 threads – two on one socket, on separate cores
  • 4 threads – four on one socket, on separate cores
  • 8 threads – eight on one socket, packing all cores’ threads
  • 16 threads – eight on each socket, packing all cores’ threads

Now we see the problem! Because Hyper-Threading shares execution resources between the two threads of each core, you don’t get as much work done when both threads are packed.

If I hadn’t had Hyper-Threading on, 8 threads would’ve scaled better but it wouldn’t be able to run 16 (at a sllliiiggghhhtt speed boost) anymore.

Putting it back on SMP mode for now. For a system that’s always under full load NUMA is theoretically superior, but it interacts with Hyper-Threading weirdly for my up-and-down desktop/workstation workload.