2

I am trying to make a byte-serial implementation of the AES 128-Bit in ECB-mode for my studies. I understand the "normal" (word-parallel) implementation of AES and the way of Eli Biham (bit-serial). Both extremes of the representation variants, namely bit-serial and word-parallel, were examined, but what is about byte-serial?

I don't understand how such an implementation could look like. Do I need to convert the SBox into logical gates? What is about the MixColumns and Shiftrow operation. Are they still for free, like at Eli Bihams representation?

1 Answers1

1

If "bit-serial and bitslice(d) are equated", the question is about what I'd call bytesliced AES, by analogy with bitsliced AES. That carries $k$ simulataneous AES operations on a machine with (I'll assume, exactly) $8k$-bit words, and uses steps that compute on $k$ bytes in parallel. Another slightly different possibility is that the question is about SIMD implementation of AES on hardware with $k$ bytewide ALUs.

The $k$ input blocks of 16 bytes are split into 16 words, each concatenating the bytes of a given rank in the input blocks. Like in bitsliced AES, ShiftRows thus reduces to selection of the appropriate word for the next step. AddRoundKey reduces to XOR with a word consisting of the same byte repeated $k$ times. More generally, when there's an addition of a byte in $\mathbb F_{2^8}$ prescribed by AES, we can perform that for all $k$ AES instances with a single word XOR.

In MixColumns, the same multiplicative coefficient in $\{1,2,3\}$ is applied to all bytes of a given word, easing implementation. Ideally there would be hardware support for parallel byte-wide arithmetic in $\mathbb F_{2^8}$ but lacking that, it's still possible to be fairly efficient in a high level language. e.g. for $k=8$ (64-bit words), multiplication in $\mathbb F_{2^8}$ of the bytes in w by $2$, could I think (not tested) go:

w = ( (0x8080808080808080 - (w>>7 & 0x0101010101010101)) & 0x1B1B1B1B1B1B1B1B
    ) ^ (w<<1 & 0xFEFEFEFEFEFEFEFE);

Note: a SIMD implementation can just use the usual

b = ((-(b>>7)) & 0x1B) ^ (b<<1);

The one difficult step is SubBytes, if there's no hardware support for it. I suspect some of the techniques there allow to slightly improve on going full bitwise, but I have nothing canned to propose. What's optimum surely depends on the available hardware.

fgrieu
  • 149,326
  • 13
  • 324
  • 622