Version | Compressed (28-byte keys) |
Uncompressed (56-byte keys) |
Chip | MHz | Compiler |
---|---|---|---|---|---|
0.75 | 678633 | 595537 | Athlon (642) | 850 | gcc 2.95.3 |
0.70 | 747418 | 644972 | Athlon (642) | 850 | gcc 2.95.3 |
0.75 | 785900 | 668566 | UltraSPARC II | 168 | egcs 2.91.66 |
0.75 | 823862 | 724776 | Pentium III (672) | 449 | egcs 2.91.66 |
0.75 | 826955 | 726922 | Pentium III (686) | 853 | gcc 2.95.3 |
0.75 | 835530 | 734731 | Pentium II (652) | 401 | gcc 2.8.1 |
0.70 | 838988 | 716298 | UltraSPARC II | 296 | egcs 2.91.66 |
0.70 | 854587 | 710991 | UltraSPARC II | 168 | egcs 2.91.66 |
0.70 | 913905 | 798360 | Pentium III (672) | 449 | egcs 2.91.66 |
0.70 | 916561 | 798847 | Pentium III (686) | 853 | gcc 2.95.3 |
0.70 | 926777 | 809653 | Pentium Pro (619) | 180 | gcc 2.95.3 |
0.70 | 928230 | 806682 | Pentium II (652) | 401 | gcc 2.8.1 |
0.75 | 943244 | 827360 | Pentium 4 (f05) | 1406 | gcc 2.95.3 |
0.70 | 1080160 | 933272 | Pentium 4 (f05) | 1406 | gcc 2.95.3 |
0.75 | 1120824 | 985097 | Pentium (525) | 90 | egcs 2.91.66 |
0.75 | 1166080 | 1019027 | PowerPC RS64-III | 668 | gcc 3.0.1 |
0.70 | 1332077 | 1164829 | Pentium (525) | 90 | gcc 2.8.1 |
0.70 | 1334910 | 1146536 | Pentium (525) | 90 | egcs 2.91.66 |
0.75 | 1355344 | 1170368 | PowerPC 7410 | 533 | gcc 2.95.2 |
0.70 | 1355530 | 1187282 | Pentium (52c) | 133 | gcc 2.8.1 |
0.70 | 2453408 | 2133621 | K6-2 (58c) | 501 | gcc 2.95.2 |
0.70 | ~4100000 | ~3600000 | SuperSPARC | 50 | gcc 2.95.2 |
0.65 | ~4300000 | ~3800000 | 486 | 33 | egcs 2.91.66 |
The 31 remaining numbers on each line are times for 31 more function calls in quick succession. Variations in the 56 and 28 lines reflect the number of nibbles equal to 0 in different exponents. Much smaller variations in the 56high and 28high lines reflect branch-prediction penalties, instruction-reordering penalties, etc. Occasional jumps in timing reflect interrupts.
Times are expressed in ticks. There are several possibilities for ticks, from most informative to least informative:
53 + | 53 +* | 53 +,* | 64 + | 64 +* | 64 +,* | |
---|---|---|---|---|---|---|
Athlon | 1 | 1 | 1 | 1 | 1 | 1 |
Pentium 4 | 1 | 1 (SSE2) | 1 (SSE2) | 1 | 2 | 2 |
Pentium Pro/II/III | 1 | 2 | 2 | 1 | 2 | 2 |
Pentium | 1 | 2 | 2 | 1 | 2 | 2 |
Alpha | 1 | 1 | 1 | |||
UltraSPARC | 1 | 1 | 1 | |||
PowerPC 7450 | 1.25 | 1.25 | 2.50 |
Most of the time in my 28high and 56high computations is spent in thousands of calls to the core f2, f2g, fg, fgh, fg8h2, and fghi functions:
f2 | f2g | fg | fgh | fg8h2 | fghi | |
---|---|---|---|---|---|---|
28high | 1015 | 511 | 1031 | 119 | 225 | 59 |
56high | 799 | 511 | 882 | 120 | 225 | 59 |
The core functions each involve hundreds of floating-point additions and multiplications:
ppro 64 | pentium 64 | sparc 53 | powerpc 53 | |
---|---|---|---|---|
f2 +/+*/+,* | 63/0/52 | 63/0/52 | 60/0/100 | 60/80/20 |
f2g +/+*/+,* | 71/0/52 | 63/0/52 | 60/0/100 | 60/80/20 |
fg +/+*/+,* | 56/0/80 | 56/0/80 | 49/0/166 | 49/139/27 |
fgh +/+*/+,* | 64/0/80 | 56/0/80 | 49/0/166 | 49/139/27 |
fg8h2 +/+*/+,* | 64/0/116 | 48/0/144 | 37/0/310 | 37/273/37 |
fghi +/+*/+,* | 48/0/144 | 48/0/144 | 37/0/310 | 37/273/37 |
The core functions involve fewer floating-point operations under ppro and pentium than under sparc and powerpc, because 64-bit operations do more work than 53-bit operations.
pentium, sparc, and powerpc handle f2, fg, and fg8h2 using f2g, fgh, and fghi. I could save about 4% of the floating-point operations by implementing f2, fg, and fg8h2 separately. This would be a bad idea on the original Pentium, which has an 8K instruction cache, and it would probably be a bad idea on the UltraSPARC, which has a 16K instruction cache and a verbose instruction encoding. It would be a good idea on the Pentium MMX and on the PowerPC.
Athlon. The 56high target is 343798 cycles for 343798 additions in the core functions. I'm currently using about 605000 cycles.
I've started optimizing code for this chip. Cycle counts for the core functions in version 0.75 (including function call and timing overhead) are 183, 192, 252, 263, 328, and 340, including 115, 123, 136, 144, 180, and 192 additions. Cycle counts for the development version are 173, 178, 207, 215, 287, and 297, including 115, 123, 136, 144, 188, and 200 additions; about a 1.14x improvement overall.
UltraSPARC. The target is 523578 cycles for 523578 additions. I'm currently using about 678000 cycles.
Pentium Pro/II/III. The target is 526674 cycles for 182876 multiplications and 343798 additions. I'm currently using about 736000 cycles.
Pentium 4. The target is 526674 cycles, as on the Pentium Pro/II/III. The operation latencies are much worse, so it's not a surprise that I'm currently using about 839000 cycles. I don't know whether SSE2 would be faster.
Pentium. The current code is pretty damn good. According to my scheduling tools, the core functions spend only 2.5% of their time in stalls. It might be possible to remove a few more loads and stores.
RS64-III. The target is, if I understand this chip correctly, 734175 cycles. I'm currently using about 1050000 cycles.
More than 100000 cycles are being wasted by the PowerPC callee-save convention. This problem is fairly tricky to work around by hand, and fairly easy to fix in the compiler. I've complained to the gcc authors.
Better scheduling might replace 9, 16, and 16 +,*'s with +*'s in f2g, fgh, and fghi respectively. Also, as noted above, current PowerPC chips have a big enough cache for separate implementations of f2, fg, and fg8h2. These changes would reduce the target.
1.2 million Pentium-II cycles by Brown, Hankerson, Hernandez, and Menezes (2000) for NIST P-224, a random curve over a field of size 2^224-2^96+1. (The same code takes 1.8 million Pentium cycles, or 2.7 million Pentium-4 cycles.)
5.1 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^224-1025.
7.7 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^239.
1.2 million Pentium-II cycles by Hankerson, Hernandez, and Menezes (2000) for NIST K-233, a structured curve over a field of size 2^233.
3.6 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^192-3345.
4.2 million Pentium Pro cycles by De Win, Bosselaers, Vanderberghe, De Gersem, and Vandewalle (1998) for a random curve over a field of prime size near 2^192.
10 million Pentium Pro cycles by De Win, Bosselaers, Vanderberghe, De Gersem, and Vandewalle (1998) for a random curve over a field of size 2^191.
4.8 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^191.
0.78 million Pentium-II cycles by Kobayashi, Morita, Kobayashi, and Horito (1999) for a curve over a field of size (2^31-1)^6.
3.1 million Pentium cycles by Bailey and Paar (1999) for a random curve over a field of size (2^31-1)^6.
0.50 million Alpha-21264 cycles by Kobayashi, Morita, Kobayashi, and Horito (1999) for a curve over a field of size (2^61-1)^3.
0.65 million Alpha-21264 cycles by Bailey and Paar (1999) for a random curve over a field of size (2^61-1)^3.
12.8 million Pentium cycles by De Win, Mister, Preneel, and Wiener (1996) for a random curve over a field of size 2^177.
9.6 million Pentium cycles by De Win, Mister, Preneel, and Wiener (1996) for a random curve over a field of size 2^176. Note that 176 has a factor of 16.
7.5 million Pentium-II cycles by Aydos, Savas, and Koc (1999) for a random curve over a field of size 2^176.
11.8 million Alpha cycles by Guajardo and Paar (1998) for a curve over a field of size 2^176.
0.61 million Pentium-III cycles by Harley (2001) for a random curve over a field of size 2^163.
4.1 million UltraSPARC cycles by Lopez and Dahab (1999) for a random curve over a field of size 2^163.
2.3 million UltraSPARC cycles by Cohen, Miyaji, and Ono (1998) for a random curve over a field of size 2^160-2933.
0.58 million Pentium-II cycles by Hankerson, Hernandez, and Menezes (2000) for NIST K-163, a structured curve over a field of size 2^163.
1.2 million UltraSPARC cycles by Certicom (1998) for a curve over a field of size 2^163, perhaps NIST K-163. (This is the most recent Certicom timing I've found, from a Security Builder 1.2 advertisement. Certicom appears to have stopped publishing any concrete timings.)