The "frequency divided by four" misconception

One of the favorite misconception about our UltraSPARC T1/2 systems is the “Each core has 1.2 GHz, so divide it by four threads, each core has 300 MHz”-meme. This meme was introduces by our beloved market companions like HP or IBM. At first, this sounds quite intutive, but do you really think that such a processor would be the fundation of the fastest single socket SAP system? Denis Sheahan explains in Lesons learned from T1 the misconception behind this FUD:

This line of argument doesn't hold because most commercial code chases pointers and is constantly loading data structures. On average a commercial application stalls every 100 instructions for a variety of reasons such as TLB miss, I cache miss, Level 2 cache miss etc. When a thread stalls it is usually delayed for many cycles, an Icache miss for instance is 23 cycles. So even though a thread is running at 1.2GHz it usually spends 70% of its time stalled. This is why major processor manufacturers create ever deeper out-of-order pipelines in an effort to avoid this stall.
All this stalling is perfect for CMT. The hardware automatically switches out a thread when it stalls and shares its cycles amongst the other 3 threads on the pipeline masking the stall. With this technique we can utilize the pipeline 75% - 80 of the time provided there are enough threads to absorb the stall

Even this explanation leaves out some cases: The cryptographic units work largely parallel to the pipelines. Thus the computational power of one core is even higher than you would assume from the pure frequency. To say it a little bit simplified: For cryptographic workloads you can see this processor as an 64+8 threads and 8+8 cores system, albeit the 8 cryptographic cores are specialized ones …