In particular, the June 2002 Lenstra-Shamir-Tomlinson-Tromer paper is completely incorrect when it claims that the use of mesh routing in this context is an ``improvement'' over my grant proposal by ``several orders of magnitude.''
The use of mesh routing in this context is one of many obvious possibilities. In March 2001, in my first ``NSA sieving circuit'' talk, I used mesh routing:
#1 #2 #3 (2,1) (2,2) (2,3) (5,2) (7,1) (7,2) #4 #5 #6 (2,4) (2,5) (3,1) #7 #8 #9 (3,2) (3,3) (5,1) For each (p,i), processor generates ith multiple s of p in {n+1,n+2,...,n+y}, if there is one, and sends (p,s) to #(s-n) through the mesh.In my October 2001 grant proposal, I used mesh sorting for my sample circuit, but on page 7 I specifically mentioned the literature on ``mesh routing and mesh sorting.'' Furthermore, in an April 2002 email message to Arjen Lenstra, as part of explaining why conventional implementations aren't cost-effective, I stated details of a counterclockwise mesh-routing algorithm for NFS sieving and NFS linear algebra:
Say you have a lot of money and a 1024-bit number to factor. Traditional sieving and linear algebra will want hundreds of gigabytes of RAM for each CPU. That's maybe $100000 of DRAM+interconnections sitting around waiting for a $50 CPU. (Never mind the fact that this isn't the kind of machine you can actually buy right now.) Replacement, with the same amount of memory: a 20x20 array of $750 machines, each with $250 of DRAM, a $50 CPU, and four Gigabit Ethernet connections to its neighbors. (These _are_ machines that you can buy right now.) Take a batch of, say, 1000 hits generated by each processor; route the hits 19 steps down, then 19 steps right, then 19 steps up, then 19 steps left, to reach the proper memory locations. Bottom line: The time for 20^2 1000 sieve hits has been replaced by the time for 1000 sieve hits plus 4(20-1)1000 Ethernet transmissions of (say) 64-bit chunks, at triple the cost. The transmissions occupy the wires for about 5 milliseconds; that should also be ample time for the CPUs to handle sending, routing, and receiving. (There are many ways to improve this, obviously.) Do you have a machine that can do a DRAM hit in under 38 nanoseconds? The point isn't that these particular choices are a speedup. The point is that the time ratio grows linearly with the 20, while the cost ratio is simply 1 plus overhead. The overhead can be dramatically reduced with special-purpose circuitry, allowing the 20 to be raised by half as many orders of magnitude. (And that's without tossing in a parallel early-abort rho+ECM stage.) Here's another way to think about the speedup. General-purpose computers are normally bought with a rough balance between CPU costs and DRAM costs. DRAM has been getting faster and larger. CPUs have also been getting faster _and larger_. Where is the extra size in CPUs going? Answer: fast FPUs, huge caches, out-of-order operations on hundreds of registers, etc. A CPU equally useful for factorization could be built at much lower cost---and the cost gap will widen as DRAM continues getting larger. If you assume that a 2x improvement in DRAM speed _and_ size will be accompanied by only a 2x improvement in CPU speed, you're overestimating the cost of the CPU, or underestimating the cost of DRAM. Bob Silverman was right: memory is not free. Someday hundreds of gigabytes of DRAM will cost $100, and $100 CPUs will have expanded to tens of billions of transistors. What will all those transistors be accomplishing for factorization, or for other serious computational problems, if the entire CPU is bottlenecked sending its results to DRAM? It's obviously more cost-effective to have millions of processing cells, each with a limited amount of DRAM, each with a few connections to nearby cells. Whether this is called ``building a VLSI circuit'' or ``programming a computational RAM'' or ``designing a parallel algorithm that uses very little memory per processor,'' it's the right model of computation, and not one for which NFS was previously optimized.
How could Lenstra, Shamir, Tomlinson, and Tromer publish this obvious use of mesh routing in June 2002 as if it were an interesting new result? I realize that they didn't see my March 2001 talk, but they certainly saw my grant proposal and my April 2002 email.
I am annoyed that my grant proposal has been characterized as presenting slower algorithms than it actually does. I informed Lenstra, Shamir, Tomlinson, and Tromer in plain language that their claim of an ``improvement'' was false; I am disappointed that they have refused to properly assign credit. Are they next going to claim that an early-abort combination of rho and ECM is an ``improvement'' over my grant proposal?