This page is out of date. Several further speed improvements appear in the Salsa20 stream cipher software. Some of the optimizations are described in more detail in the ``Salsa20 speed'' document.
salsa20_word_athlon takes 604 Athlon cycles, including function-call overhead and 11 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. An average round takes 26.75 cycles. The compiled code occupies 1280 bytes.
One could shoot for 500 Athlon cycles, considering the total number of instructions that need to be carried out.
salsa20_word_pii takes 872 Pentium III cycles, including function-call overhead and 35 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 1280 bytes.
salsa20_word_pii actually takes 859 cycles most of the time but 908 cycles on every fourth call, presumably because of branch mispredictions. An average double-round takes 75 cycles.
For comparison, salsa20_word_pm takes about 1300 Pentium III cycles. One could shoot for about 730 Pentium III cycles, considering the total number of operations and the total number of integer operations that need to be carried out; maybe a bit better with PADDD etc.
salsa20_word_p4 takes 1136 Pentium 4 Willamette cycles, including function-call overhead and 84 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 1144 bytes.
salsa20_word_pm takes 790 Pentium M cycles, including function-call overhead and 50 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 1248 bytes.
salsa20_word_pm actually takes 780 or 781 cycles most of the time but 856 cycles on every eighth call, presumably because of branch mispredictions. An average double-round takes 67.5 cycles.
For comparison, salsa20_word_pii takes about 830 Pentium M cycles. One could shoot for about 650 Pentium M cycles with x86 instructions; maybe a bit better with MMX/XMM instructions.
salsa20_word_aix takes 770 PowerPC RS64 IV cycles, including function-call overhead and 14 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 768 bytes.
Each double-round takes 66 PowerPC RS64 IV cycles. There is an obvious bottleneck of 64 cycles for 128 integer operations in each double-round: each rotation instruction counts as 2 integer operations on the PowerPC RS64 IV (unlike IBM's newer chips such as the 970), so each quarter-round needs 16 integer operations.
salsa20_word_macos takes approximately 584 PowerPC 7410 cycles, including function-call overhead and approximately 14 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 768 bytes.
Each double-round takes 49 PowerPC 7410 cycles. There is an obvious bottleneck of 48 cycles for 96 integer operations in each double-round.
Matthijs van Duin reports that an AltiVec implementation of Salsa20 is ``almost twice as fast as djb's non-altivec G4-tuned assembly implementation.''
salsa20_word_sparc takes 892 UltraSPARC II cycles, including function-call overhead and 11 cycles timing overhead; timing overhead was subtracted from the cycles/byte figure above. The compiled code occupies 936 bytes.
Each double-round takes 81 UltraSPARC II cycles. There is an obvious bottleneck of 80 cycles for 160 integer instructions in each double-round: each rotation needs 3 integer instructions, so each quarter-round needs 20 integer instructions.
Each double-round takes 82 UltraSPARC III cycles. There is an obvious bottleneck of 80 cycles for 160 integer instructions in each double-round: each rotation needs 3 integer instructions, so each quarter-round needs 20 integer instructions.