Fast implementations

by Helger Lipmaa

(Update 2008) All information below is somewhat out-of-date, but please feel free to email me and ask.

Since 1997, I have written assembly-level implementations of different block ciphers for Pentium MMX and compatible processors. This has resulted in several publications, and a number of fastest available implementations for the Intel IA32 architecture, including:

Complete AES (Rijndael) Library

A complete hand-optimized library of AES, including the well known block ciphers modes like CBC, ECB, but also CBC-MAC and lately also the new mode called OCB that provides both confidentiality and authenticity with one pass, being hence about twice more efficient than the CBC and CBC-MAC modes combined. Current version of OCB-AES runs at ~65.0 MBytes/s (or >500 Mbit/s) on a 800 MHz Pentium III machine.

This page contains more information on the speed of my library, also as compared to competiting products. (I update this page as often as I get more information on competing products.) My code seems to be 25% faster than the closest competitor on the Pentium III platform.

I sell this library. Please contact for me information. Note that I am currently including Pentium 4 support to my library. The results are already very good (almost 1.5 Gbit/s on the 3.06MHz P4! --- see here here for more speed numbers). The next version of the library that includes both Pentium 3 and Pentium 4 optimized subroutines will be available before the end of January 2003. I offer flat price (royalty-free), short-time support, and full documentation.

FastIDEA

A four-way FastIDEA implementation that uses the MMX technology to encrypt and decrypt four blocks in parallel by using the well-known IDEA block cipher. The resulting implementation runs at 55.5 MBytes/s (or 444 Mbit/s) on a 800 MHz Pentium III machine.

FastIDEA library is now in its version 2.0. I sell it. Please contact me.

AES Candidates

World fastest implementations of all leading AES candidates, including the proposed AES, Rijndael, but also three other main AES candidates, Twofish, RC6, and MARS. See the semiofficial AES Candidates: A Survey of Implementations page that I maintain for more information.

This library also contains highly-optimized C code for Serpent and SC2000. It is not sold in regular basis, but in case of any interest, please contact me.

Other projects

I have also experience in implementing other ciphers. For example, I had a contract with Fujitsu to implement their new block cipher SC2000. Results: 1.6x increase over the previous best, and two publications. See the second publication (it has implementation details).

I am open to proposals for implementing other ciphers/cipher modes (not only symmetric), and also to proposals for modifying my code to run fast on non-Intel processors (AMD, Transmeta, ... but also PPC etc).

New

Test runs in 2008

Processor technology has evolved a lot since I actually wrote my implementations. I just test-runned them on two contemporary machines. In all cases, I am stepping away my old (and misguided) tradition of substracting the time to do empty cycle from the actual numbers. That is, when I say that something takes 230 cycles then it actually does. However, in everything else I follow the guide lines we set down in our 2000 paper. That is, I run AES-ECB on 8000 blocks many times and then report the best timing I get. Finally, I have not optimized the old code for the new machines. In fact, I did some optimization for Pentium 4, but I did not spend too much time on it.

So, the new machines are (1) Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (stepping 7)
(2) Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (stepping 6)

On the first machine, an encryption key scheduling (written in C, compiled with gcc 4.2.3) takes 192 cycles. Encrypting one block (assembly) takes 230 cycles, and decrypting one block (assembly) takes 237 cycles in ECB. OCB encryption takes 251 cycles.

On the second machine, an encryption key scheduling (written in C, compiled with gcc 4.1.2) takes 144(!) cycles. Encrypting one block (assembly) takes 228 cycles, and decrypting one block (assembly) takes 237 cycles in ECB. OCB encryption takes 250 cycles.

I actually have four completely different assembly codes for encryption, and it comes out they all perform approximately the same. The slowest implementation is may be 5-7% slower than the fastest one.

My code numbers ``official'' for the OCB mode

In the 802.11 meeting on security (04.07.2001, Portland), Phillip Rogaway will use my numbers for the OCB mode for advertising the OCB mode:

    OCB-AES		16.9 cycles per byte _ 6.5% slower
    CBC-AES		15.9 cycles per byte /
    CBCMAC-AES		15.5 cycles per byte

The above data is for 1 Kbyte messages. Code is pure Pentium 3
assembly. The block cipher is AES- 128. Overhead so small that AES with a
C- code CBC wrapper is slightly more expensive that AES with an assembly
OCB wrapper.

(Note: the presented speed numbers can change.)

All mentioned implementations are (or will shortly be) commercially available from me for a reasonable fee. I do encourage you to email to <helger.lipmaa>gmail.com for more information on the concrete terms.

Helger Lipmaa