0

I am trying to compare the CPU cycle required for two encryption algorithms. One algorithm is AES and lets the other algorithm is B(code name). I implemented algorithm B and having fewer and simpler operations than AES and expected to take much less time/CPU cycles per encryption than AES

I am using an Intel i3, 10 gen processor, with 4MB cache. I am running both algorithms individually (with random 16-byte input) for $2^{20}$ times and taking the minimum and maximum required CPU cycles.

I see that for algorithm B CPU cycles are minimum=2707866 and maximum=4767402. But, for crypto++ definition of AES, the CPU cycles are minimum=2724 and maximum=29978194. I have performed the test multiple times and the results are almost similar. It is clear that the maximum time required for AES is much higher(7x) then algorithm B, but the minimum time for AES is much less.

I then recorded required CPU cycles for all $2^{20}$ AES encryption. I found that the first encryption is taking maximum cycles (29978194) and then the required CPU cycle reduced drastically and after 10-15 encryption it took almost same(approx 3000) CPU cycles for each encryption. For algorithm B every encryption took almost the same CPU cycles.

I do not understand the drastic reduction(10000 times) of CPU cycles for AES encryption(crypto++ library). Is there any voodoo of AES-NI? Can someone tell what kind of optimization is being done there?

Radium
  • 187
  • 7

2 Answers2

6

Actually, your numbers seem shockingly high at 2724 cycles for one block - even with the key schedule.

Crypto++ uses standard AES-NI for the encryption of blocks and for the key generation they use AESKEYGENASSIST for the SBox (unfortunately).

Ideally the expected performance would be (for their implementation, not for one with an optimized on-the-fly key expansion):

  • ~783 cycles for the keyschedule (as estimated by llvm-mca)
  • 1x 1 cycle for the initial XOR of the key
  • 10x 4 cycles for AESENC/ AESENCLAST

So overall less than 1000 cycles.

Even if we assume AES-256 here, we shouldn't get beyond 1500 cycles for the actual operations. 2700 is massively off from that and suggests inefficiencies in feeding AES data.

Also note that this kind of test is highly unfair to the AES-NI hardware because it only really shines if you give it 4 or more independent AES operations to calculate in parallel (due to latency of 4 cycles per round instruction but the throughput of 1 cycle per instruction). Furthermore note that Crypto++'s keyschedule implementation is ... rather lackluster, and optimized implementations compute a round key in less than 20 cycles, so less then 200 overall for the entire cipher.

You can find the translated key expansion code here. You can look the performance of the instructions up either at Intel's site or in Agner Fog's tables.


A fairer comparison would probably take the best non-AES-NI implementation, a AES-NI implementation of AES and an optimized implementation of your cipher and put them up against each other in these categories:

  • CBC encryption of long messages - measures the cipher latency and long messages "hide" the key schedule cost
  • CTR encryption of long messages - measures the maximal throughput
  • (Optionally) Stand-alone keyschedule computations with one encryption (which is roughly what you measured) - with an implementation optimized for fast key schedules (aka not Crypto++ as it currently stands)
SEJPM
  • 46,697
  • 9
  • 103
  • 214
2

AES has highly optimized implementations, including additions of special instructions to intel CPUs specifically for AES. but even without it is amazing how much optimization can go into the implementation og an algorithm it's quite an art.

I also have doubts in your measurements, the maximum may be only noise from context switches etc.

AES can be implemented to be very very fast hundreds of MBs per second on a modern CPU. It will be hard to beat with something you implement even if the algorithm is fundamentally more efficient


Software AES-256-CBC performance on OpenSSL

openssl speed aes-256-cbc
Doing aes-256 cbc for 3s on 16 size blocks: 34522867 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 64 size blocks: 8989219 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 256 size blocks: 2263537 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 1024 size blocks: 536651 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 8192 size blocks: 68886 aes-256 cbc's in 3.00s
Doing aes-256 cbc for 3s on 16384 size blocks: 34381 aes-256 cbc's in 3.00s

AES-NI AES-256-CBC performance on OpenSSL where available with evp

openssl speed -evp aes-256-cbc
Doing aes-256-cbc for 3s on 16 size blocks: 128709647 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 46266772 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 11740574 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 2953460 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 368469 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 181790 aes-256-cbc's in 3.00s

Approximately 5x speed up.

kelalaka
  • 49,797
  • 12
  • 123
  • 211
Meir Maor
  • 12,053
  • 1
  • 24
  • 55