As I understand it, NORX gets 3.5 CPB in an unoptimized implementation.
But NORX can be parallelized. With 4-way SIMD, I would expect about 7/8 CPB performance. This is nearly as good as AES-GCM in hardware.
Is something wrong with my assesment, or is parallel NORX a better choice in speed than ChaCha20-Poly1305? Why were sequential versions the primary recommendations?
I also have the same questions w.r.t. Keyak, which has 1.8 CPB with 2-way parallelism and presumably 0.9 CPB with 4-way parallelism.