While the CPU is waiting for data to load from RAM, is the operating system smar...

mikeash · on Oct 11, 2013

The overhead of task switching is too great for that to be useful, plus the OS would probably need to talk to RAM as part of the whole process anyway.

However, this is part of what hyperthreading accomplishes. The OS gives the CPU two tasks ahead of time, then when one task stalls, the CPU can switch over to the other one and work on it for a while.

solarexplorer · on Oct 11, 2013

This is actually what hyperthreading is all about: cache misses. I missed that in the article. There are more things missing actually, but I guess it would be too much to explain it all in a single article. Things like caches, coherence protocols, prefetching, memory disambiguation. Registers are also much more complex because you have things like register renaming, result forwarding etc. In the end there are simply much less registers than memory locations, that's why you can build faster registers than memory.

mikeash · on Oct 11, 2013

I thought hyperthreading was able to go beyond this, and e.g. execute the two streams in parallel if one is hitting the FPU and the other is doing integer work, even if neither one is stalled.

And you're right, it's missing a lot because I'm writing an article, not a book. It is fun to explore details, but ultimately you have to stop somewhere.

mistercow · on Oct 11, 2013

That was the impression I had too, but if so I can see how "this is actually what hyperthreading is all about" would make sense. Two streams of code are unlikely to have long segments of just-FPU and just-integer respectively, and even more unlikely that those streams will happen to align during execution. It happens, sure, but the gains would be smallish.

On the other hand, long periods of no cache misses followed by long periods of waiting after a cache miss are exactly what you expect from real code (especially optimized code). So I'd think that you'd have much bigger gains from that. The same goes for branch misprediction.

mikeash · on Oct 11, 2013

Well, the gains are smallish. Real-world gains from hyperthreading are on the order of 10-20% when you load up a CPU with two threads.

mistercow · on Oct 11, 2013

Yeah, but when I said "smallish" I was thinking more on the order of 1%. I would consider 10% actual gains to be quite large given the craziness of what Hyperthreading tries to accomplish.

mikeash · on Oct 11, 2013

It may also be a matter of more fully utilizing multiple integer/floating-point units. Say, if the CPU has two integer units but the current code is only using up one of them, then it could run the second hyperthread on the other. I really don't know the details though.

solarexplorer · on Oct 11, 2013

Yes, hyperthreading (aka SMT), as implemented in Intels processors, can execute instructions from several threads in the same clock cycle. Other processors, like Sun's Niagara, switch threads only on certain events like cache misses (this is known as SoEMT). Workloads with a lot of cache misses is where both really shine.

Of course it's hard to write about a complex topic, choose the right details, and make it all seem simple. Thumbs up for trying!

dexen · on Oct 11, 2013

Glad you asked ;-) It's called Hyper-threading [1] and works best when the scheduler is aware. Provided in some Intel's CPUs (2 threads per core) and in Sun's (later Oracle's) UltraSPARC T1...T5 (8 threads per core).

[1] http://en.wikipedia.org/wiki/Hyper-threading

[2] for example, CONFIG_SCHED_SMT in linux

cfallin · on Oct 12, 2013

In addition to what others said about the overhead of context switching just for a DRAM access stall (at today's DRAM latencies, which are ~200 to 400 cycles), there's an architectural issue with the idea, too. Consider that from software's point of view, missing the cache and going to DRAM is "invisible": it happens as part of executing a single instruction. Software doesn't know the cache miss happened; architecturally, the result of the memory load is the same whether it came from cache or DRAM. So to allow the OS to do something clever, the processor would have to define a way of notifying the software that a cache miss occurred, probably by raising an exception and aborting the instruction, to be resumed later (like a page fault). So it would take a nontrivial amount of effort by CPU architects to enable such an OS feature.

Interestingly, there is at least one academic proposal to do something like this [1], but I'm not aware of any real implementations.

[1] http://dl.acm.org/citation.cfm?id=891494

masklinn · on Oct 11, 2013

The OS does not. That's the job of the out of order architecture (load prediction and reordering of later non-dependent instructions).

And when that fails, it's exactly the use case for hyperthreading.