forkrun's NUMA approach is really largely based on the idea that, as you said, "real workloads mostly wait for memory". The waiting for memory gets worse in NUMA because accessing memory from a different chiplet or a different socket requires accessing data that is physically farther from the CPU and thus has higher latency. forkrun takes a somewhat unique approach in dealing with this: instead of taking data in, putting it somewhere, and reshuffling it around based on demand, forkrun immediately puts it on the correct numa node's memory when it comes in. This creates a NUMA-striped global data memfd. on NUMA forkrun duplicates most of its machinery (indexer+scanner+worker pool) per node, and each node's machinery is only offered chunks from the global data memfd that are already on node-local memory.
This directly aims to solve (or at least reduce the effect from) "CPUs waiting for memory" on NUMA systems, where the wait (if memory has to cross sockets) can be substantial.
forkrun's NUMA approach is really largely based on the idea that, as you said, "real workloads mostly wait for memory". The waiting for memory gets worse in NUMA because accessing memory from a different chiplet or a different socket requires accessing data that is physically farther from the CPU and thus has higher latency. forkrun takes a somewhat unique approach in dealing with this: instead of taking data in, putting it somewhere, and reshuffling it around based on demand, forkrun immediately puts it on the correct numa node's memory when it comes in. This creates a NUMA-striped global data memfd. on NUMA forkrun duplicates most of its machinery (indexer+scanner+worker pool) per node, and each node's machinery is only offered chunks from the global data memfd that are already on node-local memory.
This directly aims to solve (or at least reduce the effect from) "CPUs waiting for memory" on NUMA systems, where the wait (if memory has to cross sockets) can be substantial.