Gura et al. have reported a 160x160 multiplication (without any fancy algorithms) in 1986 cycles on the 8051-compatible ATmega128 with MULACC. The main loop in Poly1305-AES handles 16 bytes of input using a 130x124 multiplication (considerably smaller than 160x160), a reduction modulo 2^130-5 (relatively easy), and a 128-bit addition (easy). So it's reasonable to shoot for 2000 cycles on the ATmega128 (or even smaller with better multiplication algorithms); in other words, 125 cycles/byte.
AES reportedly takes 3168 cycles on the original 8051. Maybe it's faster on newer 8051-compatible chips. I'm not sure whether the 3168 figure includes key scheduling.
Storage: The 16-byte AES key k and 16-byte nonce n can be quickly merged into a 16-byte output AES_k(n). Other storage: 16-byte r; 16-byte message chunk; 1 byte saying how long the chunk is; 17-byte accumulator. All of this fits comfortably into the 8051 RAM, leaving many bytes free for intermediate results.