While I understand the principle of bit-slicing, several papers mention byte-sliced AES implementations (see e.g. Homomorphic Evaluation of the AES Circuit and Fast Implementations of AES on Various Platforms).
However, I don't clearly understand how byte-slicing works. Especially, one can read in the above mentioned papers that:
- 16 blocks are processed in parallel
- The permutations in ShiftRows/MixColumns are now "for free"
Could someone explain how byte-slicing works in the case of AES, and how it allows to process 16 blocks in parallel without computing the ShiftRows?