See also Salsa20.
The sequence of modifications is defined by the original Salsa10 specification, published 2004.11.11 as part of my paper ``Cache-timing attacks on AES.''
The 970 integer operations in Salsa10 take at least 485 CPU cycles on a CPU limited to 2 integer operations per cycle, and at least 324 CPU cycles on a CPU limited to 3 integer operations per cycle.
x4 ^= (x0 + x12) <<< 6;
x8 ^= (x4 + x0 ) <<< 17;
x12 += (x8 | x4 ) <<< 16;
x0 += (x12 ^ x8 ) <<< 5;
On a typical CPU,
this chain of 12 operations has a latency of at least 12 cycles,
so at least 2 lines must be performed in parallel
to keep up with 2 integer operations per cycle,
and at least 3 lines must be performed in parallel
to keep up with 3 integer operations per cycle.
r = x12;
s = x0;
r += s; r <<<= 6;
r ^= x4; x4 = r;
s += r; s <<<= 17;
s ^= x8; x8 = s;
r |= s; r <<<= 16;
r += x12; x12 = r;
s ^= r; s <<<= 5;
x0 += s;
If r, s are registers while x0, x4, x8, x12 are memory locations
then this sequence involves 12 integer operations,
6 loads (4 combined with integer operations),
and 4 stores (1 combined with an integer operation).
On a CPU limited to 3 instructions per cycle, these 17 instructions take at least 5.66666 cycles. On a CPU limited to 3 operations per cycle, these 22 operations take at least 7.33333 cycles. On a CPU limited to 3 operations per cycle where a store counts as two operations, these 26 operations take at least 8.66666 cycles.
Using one or two extra registers eliminates one or two load operations, and does not cost any extra copies if the CPU has three-operand instructions. The x86 architecture has a three-operand addition, called LEA, but not a three-operand xor.
The Pentium M handles stores more efficiently. Target: around 600 cycles. Bottleneck: operations.