D. J. Bernstein
Computer hardware
x86 speed
Pentium timings
See
Agner Fog's Pentium optimization manual
[text copy of old version]
and
Intel's Intel Architecture Optimization Manual 242816.
Pentium Pro timings
See the manuals listed above.
Note that Pentium Pro optimization is very different from Pentium optimization.
Pentium MMX timings
The Pentium MMX is essentially the same as the Pentium.
Big exceptions:
MMX instructions;
a 16K L1 data cache;
the Pentium-Pro branch-prediction mechanism;
and no first-time-in-cache pairing restrictions.
Pentium II timings
The Pentium II is essentially the same as the Pentium Pro.
Big exceptions:
MMX instructions;
a 16K L1 data cache.
Pentium III timings
The Pentium III is essentially the same as the Pentium II.
Big exceptions:
cache prefetch instructions (welcome to the 1990s, Intel!);
SSE instructions.
See
Intel's Intel Architecture Optimization Reference Manual 245127
for MMX and SSE information.
Beware that SSE uses new registers that need to be saved in context switches;
SSE code will fail sporadically on older operating systems.
Pentium 4 timings
The Pentium 4 has a similar feel to the Pentium III,
plus SSE2 instructions.
However, the internal architecture is different.
Cycle counts are generally much worse than the Pentium III,
often even worse than the original Pentium.
AMD K6-2 timings
See
AMD's Note 21924 (PDF).
AMD Athlon timings
See
AMD's Note 22007 (PDF).
The Athlon L1 data cache is only two-way but is a gigantic 64K.
(This is one of the reasons that
the Athlon is much faster than the Pentium III.)
In one cycle
it can handle two 64-bit loads,
or one 64-bit load and one 64-bit store,
or two 32-bit stores.
It has a first-level TLB
with 24 entries for 4K pages and 8 entries for large pages,
and a second-level four-way TLB with 256 entries for 4K pages.
The Athlon can do an FADD and an FMUL, along with two loads, every cycle,
if the code is properly scheduled.
(This is another of the reasons that
the Athlon is much faster than the Pentium III.)
Both FADD and FMUL have latency 4.
For example, the code
f = x[1]; f *= y[4]; r5 += f;
f = x[1]; f *= y[5]; r6 += f;
f = x[1]; f *= y[6]; r7 += f;
f = x[1]; f *= y[7]; r8 += f;
f = x[2]; f *= y[3]; r5 += f;
f = x[2]; f *= y[4]; r6 += f;
...
takes 1 cycle per line
if the 8 instruction bytes in each line
(3 for FLD with 8-bit displacement,
3 for FMUL with 8-bit displacement,
2 for FADDP) are aligned to an 8-byte boundary.
The same code takes 1.5 cycles per line if the instructions are not aligned.
Julian Ruhe suggests padding floating-point instructions with REP
to hit 8-byte boundaries;
an Athlon assembler could easily take care of this.
The Athlon does an excellent job of reordering operations.
(This is another of the reasons that
the Athlon is much faster than the Pentium III.)
Cycle counters
The Pentium line and the Athlon
have built-in 64-bit cycle counters,
measuring time since boot.
To read the cycle counter, use machine-language bytes 15 and 49;
the result is put into EAX/EDX.
Code measurement tools
Intel's Vtune Analyzer
includes a Pentium simulator and a Pentium II simulator,
but it isn't free.
A usable simulator is a tremendous asset
for programmers trying to identify bottlenecks in speed-critical code.
Every CPU company has simulators for its chips;
it amazes me that these simulators aren't released for free.
Other sources of information
The
Pentium Compiler Group
has a Pentium-optimized version of gcc;
their
documentation page
has some links to x86 chip information.
For more links try
Paul Hsieh's page.
For an introduction to programming using the x86 see
Randall Hyde's Art of Assembly Language Programming.