Week | Day | Date | Time |
---|---|---|---|
1 | Mon | 2006.10.23 | 15:30-17:00 |
1 | Tue | 2006.10.24 | 11:00-12:30, 15:30-17:00 |
1 | Wed | 2006.10.25 | 11:00-12:30, 15:30-17:00 |
1 | Thu | 2006.10.26 | 11:00-12:30, 15:30-17:00 |
1 | Fri | 2006.10.27 | 11:00-12:30 |
2 | Mon | 2006.10.30 | no lectures (students attend CCANTC workshop) |
2 | Tue | 2006.10.31 | no lectures (students attend CCANTC workshop) |
2 | Wed | 2006.11.01 | no lectures (students attend CCANTC workshop) |
2 | Thu | 2006.11.02 | no lectures (students attend CCANTC workshop) |
2 | Fri | 2006.11.03 | no lectures (students attend CCANTC workshop) |
3 | Mon | 2006.11.06 | 15:30-17:00 |
3 | Tue | 2006.11.07 | 11:00-12:30, 15:30-17:00 |
3 | Wed | 2006.11.08 | 11:00-12:30, 15:30-17:00 |
3 | Thu | 2006.11.09 | 11:00-12:30, 15:30-17:00 |
3 | Fri | 2006.11.10 | 11:00-12:30 |
4 | Mon | 2006.11.13 | no lectures (students work on course projects) |
4 | Tue | 2006.11.14 | no lectures (students work on course projects) |
4 | Wed | 2006.11.15 | no lectures (students work on course projects) |
4 | Thu | 2006.11.16 | 15:30-17:00 (project reports and discussion) |
4 | Fri | 2006.11.17 | 11:00-12:30 (project reports and discussion) |
All lectures will take place in room 230 at the Fields Institute at 222 College Street at the University of Toronto.
Course contents. Anyone with access to the Internet can intercept mail messages and web pages and other packets of data to see what they say. He can also send fake messages that look like real messages.
Cryptography protects messages sent through the Internet. Cryptography scrambles and unscrambles packets to protect against forgery and against espionage. An attacker who forges a message won't be able to scramble it in the right way, so when we unscramble it we'll see that it's a forgery and that we should throw it away. An attacker who intercepts a scrambled credit-card number won't be able to figure out the original number.
High-speed cryptography is important for busy network servers. For example, here's a quote from Cricket Liu, author of several books on the Internet's Domain Name System: ``Verifying signed resource records [i.e., performing cryptographic operations] is computationally intensive [i.e., is painfully slow]. This has slowed deployment of DNS security.'' As a practical matter, DNS security still doesn't exist; attackers can seize control over incoming mail and outgoing web pages by forging DNS records. The users need faster software to scramble DNS records.
This course will focus on the four fundamental tasks in cryptography:
Prerequisites and non-prerequisites. Students are expected to have some experience with algorithm analysis (figuring out why a computer hasn't finished a computation yet) and algorithm improvement (making the computer finish the computation more quickly). A typical student will have studied some chapters of a textbook by Cormen, Leiserson, Rivest, and Stein; will have heard that heapsort ``takes time n log n times, um, some constant''; and will have seen many algorithm improvements incorrectly described as ``optimizations.''
The most obvious difference in flavor between this course and a typical algorithm course is that this course will pay attention to constants. For example, this course will explain how a state-of-the-art function takes half a millisecond on my laptop to compute a high-security shared secret from a public conversation. Ignoring constant factors would remove all content from this statement.
Of course, these constants depend on the choice of computer. Sometimes one computer takes 10 clock cycles for an operation while another computer takes just 1 clock cycle; this affects the time for any algorithm built on top of that operation. Students might have seen some computer-dependent algorithm analysis and algorithm improvement in courses on architecture, assembly language, ``optimizing'' compilers, supercomputing, etc. None of those courses are prerequisites; this course will include a detailed introduction to the speed of real computers.
Many higher-level speedups are also constant-factor speedups. In fact, most of the speedups in the algorithm-research literature are constant-factor speedups. However, students will not be presumed to have any previous experience handling constant factors in algorithm analysis.
The other obvious difference in flavor between this course and other algorithm courses is the choice of functions being computed. A typical introductory algorithm course spends time on sorting, for example, and on various graph operations such as shortest paths. This course instead focuses on a few critical cryptographic operations. Those operations rely on some mathematics that students might have seen in previous courses on algebra, number theory, or cryptography. None of those are prerequisites; this course will include an introduction to the relevant mathematics.
Here's a portion of the latest qhasm rewrite: qhasm-20061027.tar.gz. This doesn't include qhasm's cycle counter, range analyzer, and scheduler, but it does include qhasm's instruction decoder, register allocator (completely rewritten this week to be much faster), x86 floating-point stack handler, and assembler. What this means, in a nutshell, is that you can write .q files and feed them to qhasm-x86 to produce .s files in traditional x86 assembly language. Those .s files, in turn, can be linked easily with C and C++ programs.
These tools are nowhere near the polished state that I like to reach before releasing software. I haven't even had time for minimal tests! I nevertheless feel completely comfortable recommending qhasm as the programming environment for the course projects, simply because the alternative environments do such a bad job of producing high-speed software. If something in qhasm doesn't work, let me know, and I'll have it fixed promptly.
An example of a .q file is addvector.q. To turn this into addvector.s, first download and compile qhasm:
wget https://cr.yp.to/qhasm/qhasm-20061027.tar.gz gunzip < qhasm-20061027.tar.gz | tar -xf - cd qhasm-20061027 ./doThen download addvector.q in the qhasm-20061027 directory and feed it through qhasm-x86:
wget https://cr.yp.to/qhasm/addvector.q ./qhasm-x86 < addvector.q > addvector.sThere's an addvector.c program that tests the result:
wget https://cr.yp.to/qhasm/addvector.c gcc -o addvector addvector.c addvector.s ./addvectorTo see more of what can be in a .q file, look at the X86 file in the qhasm-20061027 directory. The X86 file lists .q instructions and then, after a colon, information about the translation of those instructions into traditional assembly language.
In response to requests for supporting additional CPUs, I've posted a new version of qhasm that includes qhasm-ppc32-linux, qhasm-ppc32-macos, qhasm-sparc, and qhasm-x86. Here's how to download and compile the new version:
wget https://cr.yp.to/qhasm/qhasm-20061102.tar.gz gunzip < qhasm-20061102.tar.gz | tar -xf - cd qhasm-20061102 ./doBeware that this version also has a new syntax for handling CPU flags. Previously you were allowed to say things like
counter -= 1 ... goto topofloop if !=where the counter -= 1 instruction implicitly sets the CPU's equality flag and the goto instruction explicitly uses the flag. You now have to explicitly indicate the flags that you'll be testing later:
=? counter -= 1 ... goto topofloop if !=This allows qhasm to check for a common assembly-language error, namely intermediate instructions (...) that implicitly set the flag; there are quite a few flag-setting instructions, especially on the x86 architecture. The new syntax also allows comprehensible selection of different flag options on architectures providing many options, such as the PowerPC.
This version of qhasm needs even more testing than the previous version. I'll be doing a lot of testing this weekend: one of my projects for the next few days is writing a new implementation of the Curve25519 elliptic-curve-Diffie-Hellman function, one of the functions that we'll explore next week. The original Curve25519 software is written in x86 (Pentium, Athlon, etc.) assembly language and relies heavily on 80-bit (extended-precision, 64-bit mantissa) floating-point registers; the new project is to handle other CPUs (UltraSPARC, PowerPC, etc.) using 64-bit (double-precision, 53-bit mantissa) floating-point registers. I'm planning to post code samples here---stay tuned.
How can I efficiently decompose multiplication mod 2^255-19 into the smaller instructions available on a real CPU? The answer depends on the CPU. Let's start with my teaching tool, the UltraSPARC III, before moving on to more popular CPUs.
Recall that the computational highlight of the UltraSPARC is the floating-point multiplier, which can very quickly compute the product of two double-precision floating-point numbers. The multiplier rounds its output to 53-bit precision, but it guarantees exact results for products that already fit into 53 bits, such as integers in the range {-2^53+1,...,-1,0,1,...,2^53-1}. The multiplier's maximum throughput is one product per CPU cycle.
Experience tells me that the floating-point adder is another bottleneck. The adder's maximum throughput is one addition per CPU cycle. So I'll carefully count the number of floating-point additions and the number of floating-point multiplications. Hopefully there won't be any other bottlenecks.
Integer representation. I'll try breaking 255-bit numbers into 12 pieces, each piece having 22 bits. Two pieces will have a 44-bit product, so I should be able to add quite a few of those products without overflowing the 53-bit mantissa of a double-precision floating-point number. Of course, I'll have to carefully check the final code to guarantee that there aren't any accidental overflows, and in the end it might turn out that more than 12 pieces are necessary, but for the moment I'll simply assume that 12 pieces will work.
You might object that 255 isn't divisible by 12: the ratio 255/12 = 21.25 isn't an integer. How exactly do I break a 255-bit number into 12 pieces? Answer: I'll write a 255-bit number f as a sum f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 where each f_i is a small multiple of 2^i. The indices i here, namely 0 22 43 etc., are ceilings of multiples of 21.25.
Each f_i fits easily into a floating-point register, so f is represented as a vector of 12 values (f_0,f_22,...,f_234) stored in floating-point registers. All required operations on integers modulo 2^255-19 will be split into floating-point operations.
Addition: 12 fadd + 0 fmul. If f = f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 and g = g_0 + g_22 + g_43 + g_64 + g_85 + g_107 + g_128 + g_149 + g_170 + g_192 + g_213 + g_234 then f + g = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 where h_i = f_i + g_i.
Computing (h_0,h_22,...,h_234) given (f_0,f_22,...,f_234) and (g_0,g_22,...,g_234) takes 12 floating-point additions if f_i and g_i are sufficiently small. Of course, the bound on h_i is larger than the bounds on f_i and g_i, so a series of additions requires occasional carries, but there aren't any chains of additions in Curve25519.
Here's a simple C function that reads (f_0,f_22,...,f_234) from f[0],f[1],...,f[11], reads (g_0,g_22,...,g_234) from g[0],g[1],...,g[11], and puts (h_0,h_22,...,h_234) into h[0],h[1],...,h[11]:
void curve25519_add(double h[12],const double f[12],const double g[12]) { h[0] = f[0] + g[0]; h[1] = f[1] + g[1]; h[2] = f[2] + g[2]; h[3] = f[3] + g[3]; h[4] = f[4] + g[4]; h[5] = f[5] + g[5]; h[6] = f[6] + g[6]; h[7] = f[7] + g[7]; h[8] = f[8] + g[8]; h[9] = f[9] + g[9]; h[10] = f[10] + g[10]; h[11] = f[11] + g[11]; }Perhaps I should comment that, in C, array parameters are actually pointers, so h[0],...,h[11] are stored in the location specified by the function caller. I could have written the same function as
void curve25519_add(double *h,const double *f,const double *g) { h[0] = f[0] + g[0]; h[1] = f[1] + g[1]; h[2] = f[2] + g[2]; h[3] = f[3] + g[3]; h[4] = f[4] + g[4]; h[5] = f[5] + g[5]; h[6] = f[6] + g[6]; h[7] = f[7] + g[7]; h[8] = f[8] + g[8]; h[9] = f[9] + g[9]; h[10] = f[10] + g[10]; h[11] = f[11] + g[11]; }and would have had to do so to make the function also work in C++. On the other hand, I like seeing [12] as a reminder that I'm working with 12-component vectors.
Anyway, the line
h[0] = f[0] + g[0]actually produces four CPU instructions: load f[0] into a register, load g[0] into a register, add the two registers, and store the result in h[0]. To see these CPU instructions more clearly in the code, I split the line into several lines:
double f0; /* set aside double-precision storage location f0 */ double g0; /* set aside double-precision storage location g0 */ double h0; /* set aside double-precision storage location h0 */ f0 = f[0]; /* copy 8 bytes from memory location f[0] to register f0 */ g0 = g[0]; /* copy 8 bytes from memory location g[0] to register g0 */ h0 = f0 + g0; /* add register f0 to register g0 */ h[0] = h0; /* copy result from register h0 to memory location h[0] */I did the same for the other lines, producing the following equivalent C function:
void curve25519_add(double *h,const double *f,const double *g) { double f0; double f1; double f2; double f3; double f4; double f5; double f6; double f7; double f8; double f9; double f10; double f11; double g0; double g1; double g2; double g3; double g4; double g5; double g6; double g7; double g8; double g9; double g10; double g11; double h0; double h1; double h2; double h3; double h4; double h5; double h6; double h7; double h8; double h9; double h10; double h11; f0 = f[0]; f1 = f[1]; f2 = f[2]; f3 = f[3]; f4 = f[4]; f5 = f[5]; f6 = f[6]; f7 = f[7]; f8 = f[8]; f9 = f[9]; f10 = f[10]; f11 = f[11]; g0 = g[0]; g1 = g[1]; g2 = g[2]; g3 = g[3]; g4 = g[4]; g5 = g[5]; g6 = g[6]; g7 = g[7]; g8 = g[8]; g9 = g[9]; g10 = g[10]; g11 = g[11]; h0 = f0 + g0; h1 = f1 + g1; h2 = f2 + g2; h3 = f3 + g3; h4 = f4 + g4; h5 = f5 + g5; h6 = f6 + g6; h7 = f7 + g7; h8 = f8 + g8; h9 = f9 + g9; h10 = f10 + g10; h11 = f11 + g11; h[0] = h0; h[1] = h1; h[2] = h2; h[3] = h3; h[4] = h4; h[5] = h5; h[6] = h6; h[7] = h7; h[8] = h8; h[9] = h9; h[10] = h10; h[11] = h11; }You might be wondering how the UltraSPARC, with only 32 double-precision floating-point registers, can handle the 36 values f0,f1,...,g0,g1,...,h0,h1,... declared here. The answer is that several values can be assigned to the same register if they aren't ``alive'' at the same time. For example, I peeked at the C compiler's output and saw that it had translated
f0 = f[0]; g0 = g[0]; h0 = f0 + g0; h[0] = h0;into (essentially)
%f8 = f[0]; %f32 = g[0]; %f8 = %f8 + %f32; h[0] = %f8;putting both f0 and h0 into register %f8.
The above C code is still hiding an important aspect of the corresponding CPU instructions, namely the multiplication involved in array indexing. Loading f[i] from memory means taking the supplied address f, adding 8*i bytes to it to find the address of f[i], and then loading the 8-byte floating-point number at that address. To see the CPU instructions more clearly, I replaced the double pointers by byte pointers:
void curve25519_add(char *h,char *f,char *g) { double f0; double f1; double f2; double f3; double f4; double f5; double f6; double f7; double f8; double f9; double f10; double f11; double g0; double g1; double g2; double g3; double g4; double g5; double g6; double g7; double g8; double g9; double g10; double g11; double h0; double h1; double h2; double h3; double h4; double h5; double h6; double h7; double h8; double h9; double h10; double h11; f0 = *(double *) (f + 0); f1 = *(double *) (f + 8); f2 = *(double *) (f + 16); f3 = *(double *) (f + 24); f4 = *(double *) (f + 32); f5 = *(double *) (f + 40); f6 = *(double *) (f + 48); f7 = *(double *) (f + 56); f8 = *(double *) (f + 64); f9 = *(double *) (f + 72); f10 = *(double *) (f + 80); f11 = *(double *) (f + 88); g0 = *(double *) (g + 0); g1 = *(double *) (g + 8); g2 = *(double *) (g + 16); g3 = *(double *) (g + 24); g4 = *(double *) (g + 32); g5 = *(double *) (g + 40); g6 = *(double *) (g + 48); g7 = *(double *) (g + 56); g8 = *(double *) (g + 64); g9 = *(double *) (g + 72); g10 = *(double *) (g + 80); g11 = *(double *) (g + 88); h0 = f0 + g0; h1 = f1 + g1; h2 = f2 + g2; h3 = f3 + g3; h4 = f4 + g4; h5 = f5 + g5; h6 = f6 + g6; h7 = f7 + g7; h8 = f8 + g8; h9 = f9 + g9; h10 = f10 + g10; h11 = f11 + g11; *(double *) (h + 0) = h0; *(double *) (h + 8) = h1; *(double *) (h + 16) = h2; *(double *) (h + 24) = h3; *(double *) (h + 32) = h4; *(double *) (h + 40) = h5; *(double *) (h + 48) = h6; *(double *) (h + 56) = h7; *(double *) (h + 64) = h8; *(double *) (h + 72) = h9; *(double *) (h + 80) = h10; *(double *) (h + 88) = h11; }The corresponding qhasm code, curve25519_add.q, is very similar:
enter curve25519_add int64 h int64 f int64 g input h input f input g float64 f0 float64 f1 float64 f2 float64 f3 float64 f4 float64 f5 float64 f6 float64 f7 float64 f8 float64 f9 float64 f10 float64 f11 float64 g0 float64 g1 float64 g2 float64 g3 float64 g4 float64 g5 float64 g6 float64 g7 float64 g8 float64 g9 float64 g10 float64 g11 float64 h0 float64 h1 float64 h2 float64 h3 float64 h4 float64 h5 float64 h6 float64 h7 float64 h8 float64 h9 float64 h10 float64 h11 f0 = *(float64 *) (f + 0) f1 = *(float64 *) (f + 8) f2 = *(float64 *) (f + 16) f3 = *(float64 *) (f + 24) f4 = *(float64 *) (f + 32) f5 = *(float64 *) (f + 40) f6 = *(float64 *) (f + 48) f7 = *(float64 *) (f + 56) f8 = *(float64 *) (f + 64) f9 = *(float64 *) (f + 72) f10 = *(float64 *) (f + 80) f11 = *(float64 *) (f + 88) g0 = *(float64 *) (g + 0) g1 = *(float64 *) (g + 8) g2 = *(float64 *) (g + 16) g3 = *(float64 *) (g + 24) g4 = *(float64 *) (g + 32) g5 = *(float64 *) (g + 40) g6 = *(float64 *) (g + 48) g7 = *(float64 *) (g + 56) g8 = *(float64 *) (g + 64) g9 = *(float64 *) (g + 72) g10 = *(float64 *) (g + 80) g11 = *(float64 *) (g + 88) h0 = f0 + g0 h1 = f1 + g1 h2 = f2 + g2 h3 = f3 + g3 h4 = f4 + g4 h5 = f5 + g5 h6 = f6 + g6 h7 = f7 + g7 h8 = f8 + g8 h9 = f9 + g9 h10 = f10 + g10 h11 = f11 + g11 *(float64 *) (h + 0) = h0 *(float64 *) (h + 8) = h1 *(float64 *) (h + 16) = h2 *(float64 *) (h + 24) = h3 *(float64 *) (h + 32) = h4 *(float64 *) (h + 40) = h5 *(float64 *) (h + 48) = h6 *(float64 *) (h + 56) = h7 *(float64 *) (h + 64) = h8 *(float64 *) (h + 72) = h9 *(float64 *) (h + 80) = h10 *(float64 *) (h + 88) = h11 leaveThis qhasm code says int64 h to set aside a 64-bit integer register for h, float64 h3 to set aside a 64-bit floating-point register for h3, etc. Like a C compiler, the qhasm register allocator can put several values into one register if the values' lifetimes don't overlap; the qhasm register allocator is extremely good at this. Unlike a C compiler, the qhasm register allocator doesn't silently insert loads and stores if it can't figure out how to fit all live values into registers; loads and stores have to be explicitly indicated by higher-level tools or by users.
After writing curve25519_add.q I wrote curve25519_addtest.c to test the addition function on a not-very-random input:
#include <stdio.h> double powers[12] = { 1.0 /* 2^0 */ , 4194304.0 /* 2^22 */ , 8796093022208.0 /* 2^43 */ , 18446744073709551616.0 /* 2^64 */ , 38685626227668133590597632.0 /* 2^85 */ , 162259276829213363391578010288128.0 /* 2^107 */ , 340282366920938463463374607431768211456.0 /* 2^128 */ , 713623846352979940529142984724747568191373312.0 /* 2^149 */ , 1496577676626844588240573268701473812127674924007424.0 /* 2^170 */ , 6277101735386680763835789423207666416102355444464034512896.0 /* 2^192 */ , 13164036458569648337239753460458804039861886925068638906788872192.0 /* 2^213 */ , 27606985387162255149739023449108101809804435888681546220650096895197184.0 /* 2^234 */ } ; double f[12]; double g[12]; double h[12]; extern void curve25519_add(double [12],const double [12],const double [12]); int main(int argc,char **argv) { int i; for (i = 0;i < 12;++i) f[i] = random() * powers[i]; printf("f=0"); for (i = 0;i < 12;++i) printf("%+.0f",f[i]); printf(";\n"); for (i = 0;i < 12;++i) g[i] = random() * powers[i]; printf("g=0"); for (i = 0;i < 12;++i) printf("%+.0f",g[i]); printf(";\n"); curve25519_add(h,f,g); printf("h=0"); for (i = 0;i < 12;++i) printf("%+.0f",h[i]); printf(";\n"); printf("h-(f+g)\n"); return 0; }I compiled curve25519_addtest.c together with curve25519_add.q, and fed the output of curve25519_addtest through gp:
./qhasm-sparc < curve25519_add.q > curve25519_add.s gcc -m64 -o curve25519_addtest curve25519_addtest.c curve25519_add.s ./curve25519_addtest | gpThe -m64 option to gcc is essential for the UltraSPARC; otherwise you end up with code compiled for the original 32-bit SPARCs. Anyway, gp reported that h-(f+g) was indeed 0. I won't worry about doing more serious tests at this point.
More operations next time.
float64 f0 float64 f22 float64 f43 ... float64 g0 float64 g22 float64 g43 ... float64 h0 float64 h22 float64 h43 ... f0 = *(float64 *) (f + 0) f22 = *(float64 *) (f + 8) f43 = *(float64 *) (f + 16) ... g0 = *(float64 *) (g + 0) g22 = *(float64 *) (g + 8) g43 = *(float64 *) (g + 16) ... h0 = f0 + g0 h22 = f22 + g22 h43 = f43 + g43 ... *(float64 *) (h + 0) = h0 *(float64 *) (h + 8) = h22 *(float64 *) (h + 16) = h43qhasm also allows underscores in names, but I'm not interested in typing quite that much. :-)
Here's how to download and test the revised addition software on an UltraSPARC:
wget https://cr.yp.to/highspeed/fall2006/curve25519_add.q wget https://cr.yp.to/highspeed/fall2006/curve25519_addtest.c ./qhasm-sparc < curve25519_add.q > curve25519_add.s gcc -m64 -o curve25519_addtest curve25519_addtest.c curve25519_add.s ./curve25519_addtest | gpNow let's look at some other integer operations.
Subtraction: 12 fadd + 0 fmul. Subtraction has the same cost as addition: simply replace
h0 = f0 + g0with
h0 = f0 - g0etc. Of course, the resulting h_i is often negative, but that's okay; floating-point numbers are allowed to be negative.
Each cycle, the UltraSPARC's floating-point adder handles one addition or one subtraction. I'll count subtractions as if they were additions.
Here's the software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_sub.q wget https://cr.yp.to/highspeed/fall2006/curve25519_subtest.c ./qhasm-sparc < curve25519_sub.q > curve25519_sub.s gcc -m64 -o curve25519_subtest curve25519_subtest.c curve25519_sub.s ./curve25519_subtest | gp
Multiplication, schoolbook method: 121 fadd + 144 fmul. If f = f_0 + f_22 + f_43 + f_64 + f_85 + f_107 + f_128 + f_149 + f_170 + f_192 + f_213 + f_234 and g = g_0 + g_22 + g_43 + g_64 + g_85 + g_107 + g_128 + g_149 + g_170 + g_192 + g_213 + g_234 then fg = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468 where
(It's important here that I chose my indices as ceilings of multiples of 21.25. If I had instead used the floors 0, 21, 42, 63, 85, ... then I would have run into trouble with f_0 g_85 + f_21 g_63 + f_42 g_42 + f_63 g_21 + f_85 g_0 being a small multiple of 2^84 but often not a multiple of the desired 2^85.)
The above formulas for h_0, h_22, ..., h_468 involve 12^2 = 144 multiplications (each f_i times each g_j) and 11^2 = 121 additions (which I think of as 144 for the number of products minus 23 for the number of outputs).
Multiplication, Karatsuba's method: 114 fadd + 142 fmul. I'm going to use ``Karatsuba's method'' to compute the same quantities h_0, h_22, ..., h_468 more efficiently. Warning: There are several different polynomial-product and integer-product formulas called ``Karatsuba's method,'' and some of them (including the formulas in Karatsuba's original 1963 paper) are slightly more expensive than the formulas shown here. There's no standard terminology for the different formulas.
Anyway, define F_0 = f_0 x^0 + f_22 x^1 + ... + f_107 x^5, F_1 = f_128 x^0 + f_149 x^1 + ... + f_234 x^5, G_0 = g_0 x^0 + g_22 x^1 + ... + g_107 x^5, and G_1 = g_128 x^0 + g_149 x^1 + ... + g_234 x^5. The goal is to compute the polynomial product (F_0 + x^6 F_1)(G_0 + x^6 G_1).
I'll compute (F_0 + x^6 F_1)(G_0 + x^6 G_1) as K + 2^127 x^6 ((F_0 + 2^(-127) F_1)(G_0 + 2^(-127) G_1) - K) where K = F_0 G_0 - 2^(-127) x^6 F_1 G_1. Observe that there are only three half-size polynomial products here, rather than the four that appear in the obvious formula F_0 G_0 + x^6 (F_0 G_1 + F_1 G_0) + x^12 F_1 G_1; so I'll end up with only 3 6^2 = 108 coefficient multiplications, rather than 4 6^2 = 144.
Here are the intermediate quantities computed. The coefficients of F_0 + 2^(-127) F_1 are
How does a constant such as 2^127 end up in a floating-point register? The standard answer has three steps. First, create an array of constants in memory, for example with the following C code:
const double curve25519_constants[2] = { 170141183460469231731687303715884105728.0 /* 2^127 */ , 0.0000000000000000000000000000000000000058774717541114375398436826861112283890933277838604376075437585313920862972736358642578125 /* 2^-127 */ } ;Second, copy the address of the array into a register, for example with the following standard sequence of UltraSPARC instructions:
int64 constants int64 constants_low constants = (curve25519_constants & 0xfffffc0000000000) >> 32 constants_low = curve25519_constants & 0xfffffc00 constants |= (curve25519_constants & 0x3ff00000000) >> 32 constants_low |= curve25519_constants & 0x3ff constants <<= 32 constants |= constants_lowThird, load the constants from memory:
float64 two127 float64 twom127 two127 = *(float64 *) (constants + 0) twom127 = *(float64 *) (constants + 8)
I wrote a straightforward curve25519_mul.q and immediately ran into errors from the assembler: I was trying to use more than 32 floating-point registers at once, keeping all 12 f's, all 12 g's, all 23 h's, etc. live through the entire computation. So I juggled the code to kill values earlier: for example, I computed h0, h22, h43, h64, h85, h107, stored them, and subtracted them from w0, w22, w43, w64, w85, w107, to free up the registers that they were using. I also reloaded the f's and g's when they were needed, rather than keeping them continuously live.
Here's the resulting software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_mul.q wget https://cr.yp.to/highspeed/fall2006/curve25519_multest.c ./qhasm-sparc < curve25519_mul.q > curve25519_mul.s gcc -m64 -o curve25519_multest curve25519_multest.c curve25519_mul.s ./curve25519_multest | gp
Squaring, schoolbook method: 55 fadd + 89 fmul. In Curve25519, many of the multiplications modulo 2^255-19 are actually squarings: f_i = g_i for each i. This means that, for example, the equation h_107 = f_0 g_107 + f_22 g_85 + f_43 g_64 + f_64 g_43 + f_85 g_22 + f_107 g_0 can be replaced by the equation h_107 = 2 (f_0 f_107 + f_22 f_85 + f_43 f_64), replacing 6 multiplications and 5 additions with 4 multiplications and 2 additions.
(I'm assuming here that doublings are carried out as multiplications by the constant 2. If floating-point multiplications turn out to be a bottleneck later then these multiplications can be replaced by additions.)
The equation h_107 = (2 f_0) f_107 + (2 f_22) f_85 + (2 f_43) f_64 is slightly better, costing only 3 multiplications and 2 additions if 2f_0, 2f_22, 2f_43 are precomputed. It's easy to see that the overall benefit of these precomputations outweighs the overall cost: there are only 11 doublings (of f_0, f_22, ..., f_213) rather than 21 doublings (in h_22, h_43, ..., h_447).
What happens when these speedups are applied to schoolbook multiplication? There are 144 fmul, of which 132 are f_i g_j for distinct i,j, of which 132/2 = 66 are eliminated at the expense of 11 doublings. The remaining 144 - 66 = 78 fmul results require only 78 - 23 = 55 fadds to be compressed into 23 outputs. Overall cost: 55 fadd + 78 fmul + 11 fmul. Software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_sq.q wget https://cr.yp.to/highspeed/fall2006/curve25519_sqtest.c ./qhasm-sparc < curve25519_sq.q > curve25519_sq.s gcc -m64 -o curve25519_sqtest curve25519_sqtest.c curve25519_sq.s ./curve25519_sqtest | gp
Can one do better? Not to my knowledge. The lowest cost I've been able to achieve for Karatsuba squaring is 63 fadd + 106 fmul. Squaring saves considerably more in the schoolbook method than it does in Karatsuba's method.
Reduction modulo 2^255-19: 11 fadd + 11 fmul. If h = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468, with h_i a multiple of 2^i, then h is congruent to r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234 modulo 2^255 - 19, where
Computing the r_i's from the h_i's takes 11 multiplications by the constant 19 2^(-255) and 11 additions. Software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_modp.q wget https://cr.yp.to/highspeed/fall2006/curve25519_modptest.c ./qhasm-sparc < curve25519_modp.q > curve25519_modp.s gcc -m64 -o curve25519_modptest curve25519_modptest.c \ curve25519_modp.s curve25519_mul.s ./curve25519_modptest | gp
Carrying and reduction modulo 2^255-19: 63 fadd + 11 fmul. ``Carrying r_0 to r_22'' means subtracting the closest multiple of 2^22 from r_0 and adding it to r_22. This doesn't change the sum r_0 + r_22, but it guarantees that r_0 is in the range {-2^21,...,-1,0,1,...,2^21}.
Similarly, ``carrying r_22 to r_43'' means subtracting the closest multiple of 2^43 from r_22 and adding it to r_43. This doesn't change the sum r_22 + r_43, but it guarantees that 2^(-22) r_22 is in the range {-2^20,...,-1,0,1,...,2^20}. By carrying r_0 to r_22, then carrying r_22 to r_43, then carrying r_43 to r_64, and so on, we guarantee that each 2^(-i) r_i is in a narrow range, making the r's suitable inputs for subsequent multiplications.
A carry takes four floating-point additions:
closest43 = r22 + alpha43 # rounded closest43 -= alpha43 # exact r22 -= closest43 # exact r43 += closest43 # exactHere alpha_43 is the constant 2^43 (2^52 + 2^51). The point is that floating-point numbers around alpha_43 are exactly the multiples of 2^43: ..., 2^43 (2^52 + 2^51 - 1), 2^43 (2^52 + 2^51), 2^43 (2^52 + 2^51 + 1), ... The sum r_22 + alpha_43 is thus rounded to the closest multiple of 2^43; subtracting alpha_43 produces the multiple of 2^43 closest to r_22, exactly as desired.
What happens to the carry from r_234? The easy answer is to carry from r_234 to r_0, reducing the carry modulo 2^255-19:
closest255 = r234 + alpha255 closest255 -= alpha255 r234 -= closest255 closest255 *= scale r0 += closest255where scale is 19 2^(-255).
This might make r_0 too big again. If r_234 starts with 50 bits, for example, then closest_255 can be as large as 2^284; so 19 2^(-255) 2^284, more than 2^33, is added to r_0. Another carry from r_0 to r_22 is required. But the carries can stop after that: a chain of 13 carries from r_0 to r_22 to r_43 to ... to r_234 to r_0 to r_22 guarantees that all r_i's are reasonably small.
One can eliminate the extra scale multiplication by carefully interleaving the carrying into the original reduction modulo 2^255-19. Here's one reasonable carry/reduce strategy, starting from h = h_0 + h_22 + h_43 + h_64 + h_85 + h_107 + h_128 + h_149 + h_170 + h_192 + h_213 + h_234 + h_255 + h_277 + h_298 + h_319 + h_340 + h_362 + h_383 + h_404 + h_425 + h_447 + h_468:
Software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_carry.q wget https://cr.yp.to/highspeed/fall2006/curve25519_carrytest.c ./qhasm-sparc < curve25519_carry.q > curve25519_carry.s gcc -m64 -o curve25519_carrytest curve25519_carrytest.c \ curve25519_carry.s curve25519_mul.s curve25519_sq.s ./curve25519_carrytest | gp
Multiplication, carrying, reduction: 177 fadd + 153 fmul. A complete integer multiplication modulo 2^255-19, producing small results suitable for subsequent multiplications, costs 114 fadd + 142 fmul for the polynomial multiplication, and an additional 63 fadd + 11 fmul for carrying and reduction.
Squaring, carrying, reduction: 118 fadd + 100 fmul. A complete integer squaring modulo 2^255-19, producing small results suitable for subsequent multiplications, costs 55 fadd + 89 fmul for the polynomial squaring, and an additional 63 fadd + 11 fmul for carrying and reduction.
What's coming up. I'll need a few more functions to build a complete Curve25519 computation:
wget https://cr.yp.to/qhasm/qhasm-20061105.tar.gz gunzip < qhasm-20061105.tar.gz | tar -xf - cd qhasm-20061105 ./doNow back to the Curve25519 operations.
Small-constant multiplication, carrying, reduction: 48 fadd + 13 fmul. As we'll see later, the Curve25519 main loop has 8 additions/subtractions and 10 multiplications. Half of the 10 multiplications are special: 4 of them are squarings, and 1 of them is a multiplication by a particular integer, namely 121665.
Multiplying by a 17-bit integer is of course easier than multiplying by a 255-bit integer: one can simply multiply each of the 12 floating-point inputs by 121665, costing 0 fadd + 12 fmul. But the results need 12 carries, costing 48 fadd + 1 fmul, so the overall benefit is not as dramatic as one might initially expect.
Software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_121665.q wget https://cr.yp.to/highspeed/fall2006/curve25519_121665test.c ./qhasm-sparc < curve25519_121665.q > curve25519_121665.s gcc -m64 -o curve25519_121665test curve25519_121665test.c curve25519_121665.s ./curve25519_121665test | gp
Selection: 0 fadd + 0 fmul. There's one additional operation in the Curve25519 main loop: given four 255-bit integers x[0], z[0], x[1], z[1] and a secret bit b, compute x[b], z[b] and x[1-b], z[1-b].
Here are three strategies for performing this operation:
I'll try the third strategy using integer bit operations:
r3 = *(uint64 *) (r + 24) s3 = *(uint64 *) (s + 24) s3 ^= r3 q3 = s3 & bminus1 q3 ^= r3 p3 = q3 ^ s3 *(uint64 *) (p + 24) = p3 *(uint64 *) (q + 24) = q3A similar sequence of 3 fadd + 1 fmul, or alternatively 2 fadd + 4 fmul, would have the virtue of working with floating-point registers, but I'm guessing that the loads and stores to move to integer registers won't be a problem, and I don't want to consume precious fadd resources.
Software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_select.q wget https://cr.yp.to/highspeed/fall2006/curve25519_selecttest.c ./qhasm-sparc < curve25519_select.q > curve25519_select.s gcc -m64 -o curve25519_selecttest curve25519_selecttest.c curve25519_select.s ./curve25519_selecttest | gp
Conversion from 32-byte string: 44 fadd + 0 fmul. Say f is in {-2^51,...,-1,0,1,...,2^51-1}. Here's how to take f out of an UltraSPARC integer register and put it into an UltraSPARC floating-point register:
It's just as easy to take f out of an UltraSPARC integer register and put a scaled version of f, let's say 2^32 f, into an UltraSPARC floating-point register:
What if f is in memory rather than in an integer register? To be concrete, let's say bytes (x[0],x[1],x[2],x[3],x[4],x[5]) in memory represent an integer f in {0,1,...,2^48-1} in the usual little-endian way: f = x[0] + 2^8 x[1] + 2^16 x[2] + 2^24 x[3] + 2^32 x[4] + 2^40 x[5]. Here's how to put the same integer f into a floating-point register. First load the bytes into registers (with separate byte loads; loading *(uint64 *) (x+0) doesn't work on the UltraSPARC unless x is 0 modulo 8):
int64 f0 int64 f1 int64 f2 int64 f3 int64 f4 int64 f5 f0 = *(uint8 *) (x + 0) f1 = *(uint8 *) (x + 1) f2 = *(uint8 *) (x + 2) f3 = *(uint8 *) (x + 3) f4 = *(uint8 *) (x + 4) f5 = *(uint8 *) (x + 5)Shift and add to compute x0 = 2^52 1075 + 2^51 + f = 2^51 2151 + f:
f1 <<= 8 f2 <<= 16 f3 <<= 24 f4 <<= 32 f5 <<= 40 int64 x0 x0 = 2151 x0 <<= 51 x0 += f0 x0 += f1 x0 += f2 x0 += f3 x0 += f4 x0 += f5Set aside 8 bytes of temporary storage in memory, and store x0 there:
stack64 y0 y0 = x0Load the 8 bytes into a floating-point register, and finally subtract 2^52 + 2^51:
float64 z0 z0 = y0 z0 -= alpha0
One of the Curve25519 inputs is a 255-bit integer stored in 32 bytes of memory. I convert
wget https://cr.yp.to/highspeed/fall2006/curve25519_todouble.q wget https://cr.yp.to/highspeed/fall2006/curve25519_todoubletest.c ./qhasm-sparc < curve25519_todouble.q > curve25519_todouble.s gcc -m64 -o curve25519_todoubletest curve25519_todoubletest.c curve25519_todouble.s ./curve25519_todoubletest | gp
One can save time here by using more integer operations, first to partition the input bits more precisely and second to create the integer representing f rather than 2^52 + 2^51 + f. But I won't bother: conversion to double is a very small part of the Curve25519 time and doesn't warrant any extra programming effort.
Conversion to 32-byte string: 12 fadd + 0 fmul. The output of Curve25519 is an integer in {0,1,...,2^255-20}, represented as a 32-byte string in the same way as the input. This isn't what's naturally produced by arithmetic:
Here's a reasonably straightforward freezing strategy. Let's say that the final carrying and reduction produce r = r_0 + r_22 + r_43 + r_64 + r_85 + r_107 + r_128 + r_149 + r_170 + r_192 + r_213 + r_234. Carrying has put each r_i into a fairly limited range:
A series of shifts and adds now converts s_0, 2^(-16) s_16, ..., 2^(-232) s_232 to integers t_0, t_32, t_64, t_96, t_128, t_160, t_192, t_224 in {0,1,...,2^32-1} with 2^255-19 + r = t_0 + 2^32 t_32 + ... + 2^224 t_224. To see whether 2^255-19 + r is below 2^255-19, I'll add 19 and inspect the 255th bit, taking advantage of the UltraSPARC's 32-bit-carry instructions:
carry32? u0 = t0 + 19 carry32? u32 = t32 + 0 + carry32 carry32? u64 = t64 + 0 + carry32 carry32? u96 = t96 + 0 + carry32 carry32? u128 = t128 + 0 + carry32 carry32? u160 = t160 + 0 + carry32 carry32? u192 = t192 + 0 + carry32 carry32? u224 = t224 + 0 + carry32 wantt = (uint32) u224 >> 31 wantt -= 1 wantu = ~wanttNow it's easy to select either r or 2^255-19 + r as the final Curve25519 output. Here's the software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_freeze.q wget https://cr.yp.to/highspeed/fall2006/curve25519_freezetest.c ./qhasm-sparc < curve25519_freeze.q > curve25519_freeze.s gcc -m64 -o curve25519_freezetest curve25519_freezetest.c \ curve25519_freeze.s curve25519_carry.s curve25519_mul.s curve25519_sq.s \ curve25519_todouble.s ./curve25519_freezetest | gp
The complete Curve25519 computation: 414907 fadd + 327626 fmul. The Curve25519 computation starts with two 32-byte inputs. It
The most important part of the computation is the main loop:
for (pos = 254;pos >= 0;--pos) { b = e[pos / 8] >> (pos & 7); b &= 1; curve25519_select(xzmb,xzm1b,xzm,xzm1,b); curve25519_add(a0,xzmb,xzmb + 12); curve25519_sub(a0 + 12,xzmb,xzmb + 12); curve25519_add(a1,xzm1b,xzm1b + 12); curve25519_sub(a1 + 12,xzm1b,xzm1b + 12); curve25519_sqcarry(b0,a0); curve25519_sqcarry(b0 + 12,a0 + 12); curve25519_mulcarry(b1,a1,a0 + 12); curve25519_mulcarry(b1 + 12,a1 + 12,a0); curve25519_add(c1,b1,b1 + 12); curve25519_sub(c1 + 12,b1,b1 + 12); curve25519_sqcarry(r,c1 + 12); curve25519_sub(s,b0,b0 + 12); curve25519_121665(t,s); curve25519_add(u,t,b0); curve25519_mulcarry(xznb,b0,b0 + 12); curve25519_mulcarry(xznb + 12,s,u); curve25519_sqcarry(xzn1b,c1); curve25519_mulcarry(xzn1b + 12,r,x1); curve25519_select(xzm,xzm1,xznb,xzn1b,b); }Each of the 255 iterations involves two selections (which can easily be merged across loops into one selection), four additions, four subtractions, one multiplication by 121665 (with carry-reduce), four squarings (with carry-reduce), and five more multiplications (with carry-reduce). Here are the tallies of floating-point operations:
operation | per operation | per iteration | per 255 iterations |
---|---|---|---|
select | 0 fadd + 0 fmul | 0 fadd + 0 fmul | 0 fadd + 0 fmul |
sub | 12 fadd + 0 fmul | 48 fadd + 0 fmul | 12240 fadd + 0 fmul |
sub | 12 fadd + 0 fmul | 48 fadd + 0 fmul | 12240 fadd + 0 fmul |
121665 | 48 fadd + 13 fmul | 48 fadd + 13 fmul | 12240 fadd + 3315 fmul |
sqcarry | 118 fadd + 100 fmul | 472 fadd + 400 fmul | 120360 fadd + 102000 fmul |
mulcarry | 177 fadd + 153 fmul | 885 fadd + 765 fmul | 225675 fadd + 195075 fmul |
total | 1501 fadd + 1178 fmul | 382755 fadd + 300390 fmul |
The second most important part of the computation is the division. I compute the (p-2)nd = (2^255-21)th power of z[n], and multiply by x[n], with a straightforward sequence of 254 squarings and 12 extra multiplications, costing 32096 fadd + 27236 fmul. There are other ways to compute the reciprocal of z[n], such as Euclid's algorithm and extended binary-gcd algorithms, but these algorithms (especially when modified to avoid leaking information through timing) would take considerable extra work to implement and would have at most a small impact on the Curve25519 speed, so I won't bother with them this month.
The remaining operations are conversion to and from floating point, costing 56 fadd + 0 fmul.
Total operation count (not computer-verified at this point): 414907 fadd + 327626 fmul. Evidently these instructions will take at least 414907 UltraSPARC cycles.
Here's the software:
wget https://cr.yp.to/highspeed/fall2006/curve25519_complete.c wget https://cr.yp.to/highspeed/fall2006/curve25519_completetest.c gcc -m64 -o curve25519_completetest curve25519_completetest.c \ curve25519_complete.c curve25519_carry.s curve25519_mul.s curve25519_sq.s \ curve25519_todouble.s curve25519_121665.s curve25519_add.s curve25519_sub.s \ curve25519_freeze.s curve25519_select.s ./curve25519_completetest | gp
Counting cycles. I've timed this Curve25519 implementation using the cpucycles library:
long long timings[10]; unsigned char x[32]; unsigned char y[32]; unsigned char z[32]; for (i = 0;i < 10;++i) { timings[i] = cpucycles(); curve25519(z,x,y); } for (i = 0;i < 9;++i) printf("%lld ",timings[i + 1] - timings[i]); printf("\n");The output shows that this implementation is taking over 2 million UltraSPARC III cycles:
2251848 2122560 2060876 2694852 2061201 2060841 2061101 3322788 2065352There's a big gap, almost a factor of 5, between the 414907-cycle lower bound from fadd throughput and this 2-million-cycle upper bound. That's a gigantic gap compared to, e.g., my NIST P-224 elliptic-curve-Diffie-Hellman software from 2001, which used about 678000 UltraSPARC II cycles for 523578 floating-point additions, or my more recent Poly1305 software, which uses 68 UltraSPARC III cycles for 68 floating-point additions in the main loop.
Why is the software failing so miserably at keeping the adder busy? Here's an illustrative 3-fadd excerpt from curve25519_mul.q:
h64 = f0 * g64 # starts on cycle 0 f22g43 = f22 * g43 # starts on cycle 1 (delayed by fmul throughput) h64 += f22g43 # starts on cycle 5 (delayed by fmul latency) f43g22 = f43 * g22 # starts on cycle 5 h64 += f43g22 # starts on cycle 9 (delayed by fmul throughput) f64g0 = f64 * g0 # starts on cycle 9 h64 += f64g0 # starts on cycle 13I didn't have to add these 4 quantities serially. I could instead add them in a tree:
h64 = f0 * g64 # starts on cycle 0 f22g43 = f22 * g43 # starts on cycle 1 (delayed by fmul throughput) f43g22 = f43 * g22 # starts on cycle 2 (delayed by fmul throughput) f64g0 = f64 * g0 # starts on cycle 3 (delayed by fmul throughput) h64 += f22g43 # starts on cycle 5 (delayed by fmul latency) f43g22 += f64g0 # starts on cycle 7 (delayed by fmul latency) h64 += f43g22 # starts on cycle 11 (delayed by fadd latency)Even better, I could interleave these operations with subsequent additions in the software. There's tremendous parallelism in the Curve25519 computation graph, with far more than 4 independent additions on each level of the graph; I simply have to make this parallelism visible to the CPU.
To be continued.
There are many other ideas in the authentication literature, but most of them are slowdowns! See Section 10 of my hash127 paper for a survey (dating from 2000) and tons of references. The main advance in the field since 2000 is a new focus on minimizing precomputation (not just having r short but also using it directly in the main computation), as I emphasized in the lecture.
Today's lecture topics:
Here's what I said about cycles/byte to authenticate a 1024-byte message, including a slight slowdown from computing AES:
CPU | Poly1305-AES | UMAC-P4-128-AES |
---|---|---|
PowerPC G4 7410 | 8.27 | 21.72 |
UltraSPARC III | 5.47 | 51.06 |
Pentium 4 | 5.33 | 3.12 |
Pentium M | 4.50 | 8.48 |
Athlon | 3.75 | 7.38 |
As for Salsa20, the best source of information is my Salsa20 page. The leading source of information about stream ciphers generally is eSTREAM (the ECRYPT Stream Cipher Project), particularly eSTREAM's collection of design papers and cryptanalysis papers; I have a web page with links and further comments.
Week | Day | Date | Time |
---|---|---|---|
3 | Thu | 2006.11.09 | 11:00-12:30, 15:30-17:00 |
3 | Fri | 2006.11.10 | 11:00-12:30 |
4 | Mon | 2006.11.13 | no lectures (students work on course projects) |
4 | Tue | 2006.11.14 | no lectures (students work on course projects) |
4 | Wed | 2006.11.15 | no lectures (students work on course projects) |
4 | Thu | 2006.11.16 | no lectures (students work on course projects) |
4 | Fri | 2006.11.17 | 11:00-12:30, 15:30-17:00 (project reports and discussion) |
Today's lecture topics:
Here's a challenge to cryptographers. Right now, attackers all over the Internet can easily forge packets that appear to come from Google; the challenge is to stop these forgeries, guaranteeing that the Google results on the users' screens really are from Google. Perhaps the most obvious approach (to the cryptographic side of the problem; we also need to fix browser buffer overflows and other relevant software bugs) is to have Google add a signature to each Google-results page. This costs
wget https://cr.yp.to/qhasm/qhasm-20061116.tar.gz gunzip < qhasm-20061116.tar.gz | tar -xf - cd qhasm-20061116 ./doStill needs lots of testing but can at least handle an AMD64 XMM version of the Salsa20 software, setting new speed records for Salsa20 on the Intel Core 2.