Benchmarks

I have speed reports for djbfft 0.76 on

a 450MHz Intel Pentium III under egcs 2.91.66,
a 400MHz Intel Pentium II under gcc 2.7.2.3,
a 296MHz Sun UltraSPARC II under egcs 2.91.66,
a 167MHz Sun UltraSPARC I under egcs 2.91.66,
a 133MHz Intel Pentium under gcc 2.7.2.1,
a 33MHz Intel 486 DX/2 under gcc 2.7.2.1, and
a 25MHz Sun 4/40 under egcs 2.91.66.

In each case the compiler options are the default options in the djbfft installation: -O1 -fomit-frame-pointer with -malign-double added automatically on the x86 processors.

I also have some speed reports for djbfft 0.75 under alternate compilers:

a 240MHz HP PA-8200 under HP-UX B.11.0 cc +O2 -Dinline and
a 240MHz HP PA-8200 under HP-UX B.11.0 cc +O3 +Oall -Dinline.

Contents of the speed reports

Codes used in the reports:

r: Real transform.
c: Complex transform.
4: Single-precision transform.
8: Double-precision transform.
+: Forward DFT.
-: Inverse DFT.
m: Multiplication. Convolution against a precomputed filter takes one forward DFT, one multiplication, and one inverse DFT.
s: Scaling. Precomputation of a filter takes one forward DFT and one scaling.
nothing: No computation. This shows the overhead of the tick-counting mechanism.
RDTSC: Tick counts are obtained from the Pentium cycle counter.
gethrtime: Tick counts are obtained from the Solaris gethrtime() nanosecond counter.

Each line shows the individual tick counts for eight iterations of the routine being benchmarked. The first iteration is normally slower than the rest; instructions may not be in cache (or even memory), inputs may not be in cache, etc. The first few iterations may wobble a bit because of branch prediction hysteresis. All the iterations will usually have different speeds for inputs larger than cache. Individual iterations may occasionally be much slower if the operating system happens to perform a context switch.

For example, the Pentium-133 lines

     Using RDTSC, pentium/*.c.
        nothing      27      17      17      18      17      17      17      18
        256 r8-   11288    8127    8102    8102    8102    8102    8102    8102

show that a 256-point in-cache double-precision real inverse DFT, with a tiny amount of timing overhead, normally takes 8102 Pentium cycles.

Notes on previous versions of djbfft

19970916: First version of djbfft. I wrote this code to prove to the FFTW authors that a simple split-radix FFT would run faster than their complicated code on a Pentium. My unscheduled code, 86 lines long, did a size-256 single-precision transform in about 35000 Pentium cycles, faster than FFTW. A few days later, after some casual instruction scheduling, I had the time down to about 24000 Pentium cycles.

19971116: djbfft 0.50. About 23000 Pentium cycles for a size-256 double-precision transform. I was still learning about the Pentium FPU at this point.

19971218: djbfft 0.55. About 20000 Pentium cycles. New in this version: inverse transforms.

19971226: djbfft 0.60. About 20000 Pentium cycles. New in this version: simultaneous support for single precision and double precision.

19980923: djbfft 0.70. About 18000 Pentium cycles, or 15000 UltraSPARC-I cycles. New in this version: multiplication routines to support complex convolution and real convolution.

19990914: djbfft 0.75. About 17000 Pentium cycles, or 6300 UltraSPARC-I cycles, or 13000 Pentium-II cycles. New in this version: real FFTs and UltraSPARC tuning.

19990930: djbfft 0.76. About 17000 Pentium cycles, or 6300 UltraSPARC-I cycles, or 12000 Pentium-II cycles. New in this version: some Pentium-II tuning.