This page focuses on speed: specifically, the time for a group exponentiation with a 255-bit exponent. Curve25519 has additional advantages under other cost measures: key size, for example, and timing-attack-protected speed.
The problem is that multiplicative-group discrete logarithms, unlike elliptic-curve discrete logarithms, have been the subject of remarkable progress. In particular, the number-field sieve appears to be a more serious threat than generic discrete-logarithm methods if p is below 2048 bits; the exact cutoff is a matter of current debate, but in any case it's clear that choosing a comfortably large p produces a slowdown outweighing the advantage of performing less work mod p.
One can use elements u,v of F_(sqrt(q)) to represent any element u + v sqrt(d) of F_q. Squaring in F_q in this representation takes 3 squarings in F_(sqrt(q)), and multiplication in F_q in this representation takes 3 or 4 multiplications in F_(sqrt(q)).
One can instead use elements u,v of F_(sqrt(q)) to represent the element (u+v sqrt(d))/(u-v sqrt(d)) of F_q. This is the ``Hilbert Theorem 90 representation''; it works only for (q+1)st roots of 1 in F_q, i.e., the ``twisted multiplicative group'' of F_(sqrt(q)), also known as ``the torus T_2.'' The advantage of this representation is that any frequently used (u,v) with nonzero v can be replaced by (u/v,1), using a single division by v to eliminate every subsequent multiplication by v. On the other hand, the number of squarings in F_(sqrt(q)) still ends up above 3 per exponent bit.
One can also use elements u,v of F_(sqrt(q)), where ((u/v)^2 - 1)/d is a square in F_q, to represent the elements u/v +- sqrt(((u/v)^2 - 1)/d) of F_q. This is the ``trace representation''; it works only for (q+1)st roots of 1 in F_q. The advantage of this representation is that it allows particularly fast computation of power m+n given powers m,n,m-n. An exponentiation algorithm of Montgomery, using differential addition chains (``Lucas chains''), is conjectured to use approximately 0.17 squarings per bit and 1.47 extra multiplications per bit.
None of these ideas, with a comfortably large q, are competitive in speed with Curve25519.
Montgomery's differential-addition-chains exponentiation algorithm was adapted to this representation by Stam and Lenstra, and is conjectured to use approximately 5.2 multiplications in F_(q^(1/6)) for each exponent bit.
This is still slower than Curve25519: a comfortably large q needs over 340 bits in q^(1/6), making 5.2 multiplications in F_(q^(1/6)) slower than Curve25519's 4 squarings and 5 multiplications mod 2^255-19. (I could also speed up Curve25519 slightly using differential addition chains, at the expense of much more difficult timing-attack protection.) Close, certainly, and worth further investigation, but so far XTR isn't setting speed records.