Fast Implementations of AES Candidates
Kazumaro Aoki, Helger Lipmaa, "Fast Implementations of AES
Candidates". Third
AES Candidate Conference, New York City, USA, 13--14 April 2000.
[ps.gz (38K), pdf
(86K)].
Abstract:
Of the five AES finalists four---MARS, RC6, Rijndael, Twofish---have not
only (expected) good security but also exceptional performance on the PC
platforms, especially on those featuring the Pentium Pro, the NIST AES
analysis platform. In the current paper we present new performance numbers
of the mentioned four ciphers resulting from our carefully optimized
assembly-language implementations on the Pentium~II, the successor of the
Pentium Pro. All our implementations follow well-defined API and timing
conventions and sensible guidelines, like no using of self-modifying code
and key-specific static data --- i.e., tricks that speed up the
implementation but at the same time restrict the field of application. Our
implementations are up to 26% percent faster than previous
implementations. Our work also shows how a simple change (inclusion of the
MMX technology) in the analysis platform can influence the relative
encryption speed of different ciphers. To enable everyone to compare their
implementations to ours, we also fully specify our procedures used to
obtain the speed numbers.
NB! My implementations are available commercially.
Errata
- The uops for MARS backward mixing (Table 2) were counted wrongly.
Instead of 85 port 01 uops, there should be 77. This also means
that in total, MARS had 311 port 01 uops (however, the total 'total'
572 was correct) and that ``more that 78%'' (Section 5.1) should
be replaced with ``about 77%''. (added 20/03/00)
- The uops/cycle number for RC6 is 1.48, not 1.47. (added 20/03/00)
- The cited online manual "How to Optimize for the Pentium
microprocessors" by Agner Fog is available from http://www.agner.org/assem/,
not from http://www.agner.com/assem/. (Thanks to Eric Young
for noticing this) (added 20/03/00)
- During the optimization, Aoki used a Pentium II while I used a Mobile
Pentium II. We both optimized our implementations specificly for the
platform we had, and our implementations may be slightly slower on
the other processor. The only (published?) difference between those
machines is that Mobile Pentium II has L2 cache of 256 KB (running at
the same speed as the processor) but Pentium II has L2 cache of 512 KB
(but running at the half of the speed). The difference in our
implementations stems from the accidentally chosen block number of
8000, which makes 8000 blocks x 16 bytes/block x 2 (plaintext and
ciphertext)=250 Kbytes which means that it is just below the cache size
the Mobile Pentium II has. (added 20/05/00)
- In page 10 it should be of course be that the assembly implementation
of Rijndael is 44% faster (not slower!) than the gcc-implementation.
(Thanks to Meelis Roos for noticing this) (added 04/10/00)
Updates
- The Rijndael implementation numbers have been improved since then. See this link for a paper that has updated values. (And please refer also
that one...) (added 07/22/02)
See AES: Speed page
for related information and possible implementation updates of this paper.
Authors: