Fast Implementations of AES Candidates

Kazumaro Aoki, Helger Lipmaa, "Fast Implementations of AES Candidates". Third AES Candidate Conference, New York City, USA, 13--14 April 2000. [ps.gz (38K), pdf (86K)].

Abstract:

Of the five AES finalists four---MARS, RC6, Rijndael, Twofish---have not only (expected) good security but also exceptional performance on the PC platforms, especially on those featuring the Pentium Pro, the NIST AES analysis platform. In the current paper we present new performance numbers of the mentioned four ciphers resulting from our carefully optimized assembly-language implementations on the Pentium~II, the successor of the Pentium Pro. All our implementations follow well-defined API and timing conventions and sensible guidelines, like no using of self-modifying code and key-specific static data --- i.e., tricks that speed up the implementation but at the same time restrict the field of application. Our implementations are up to 26% percent faster than previous implementations. Our work also shows how a simple change (inclusion of the MMX technology) in the analysis platform can influence the relative encryption speed of different ciphers. To enable everyone to compare their implementations to ours, we also fully specify our procedures used to obtain the speed numbers.

NB! My implementations are available commercially.

Errata

The uops for MARS backward mixing (Table 2) were counted wrongly. Instead of 85 port 01 uops, there should be 77. This also means that in total, MARS had 311 port 01 uops (however, the total 'total' 572 was correct) and that ``more that 78%'' (Section 5.1) should be replaced with ``about 77%''. (added 20/03/00)
The uops/cycle number for RC6 is 1.48, not 1.47. (added 20/03/00)
The cited online manual "How to Optimize for the Pentium microprocessors" by Agner Fog is available from http://www.agner.org/assem/, not from http://www.agner.com/assem/. (Thanks to Eric Young for noticing this) (added 20/03/00)
During the optimization, Aoki used a Pentium II while I used a Mobile Pentium II. We both optimized our implementations specificly for the platform we had, and our implementations may be slightly slower on the other processor. The only (published?) difference between those machines is that Mobile Pentium II has L2 cache of 256 KB (running at the same speed as the processor) but Pentium II has L2 cache of 512 KB (but running at the half of the speed). The difference in our implementations stems from the accidentally chosen block number of 8000, which makes 8000 blocks x 16 bytes/block x 2 (plaintext and ciphertext)=250 Kbytes which means that it is just below the cache size the Mobile Pentium II has. (added 20/05/00)
In page 10 it should be of course be that the assembly implementation of Rijndael is 44% faster (not slower!) than the gcc-implementation. (Thanks to Meelis Roos for noticing this) (added 04/10/00)

Updates

The Rijndael implementation numbers have been improved since then. See this link for a paper that has updated values. (And please refer also that one...) (added 07/22/02)

See AES: Speed page for related information and possible implementation updates of this paper.

Authors: