Efficiently implementing Speck96

Question

Are there any efficient implementations of Speck96? The problem is how to efficiently do 48-bit arithmetic on 64-bit words, the rotates in particular. I've been trying to implement it and have written the following in Rust:

#[inline(always)]
fn rotate_right_48(x: u64, r: usize) -> u64 {
    ((x >> r) | (x << (48 - r))) & 0x0000_ffff_ffff_ffff
}
#[inline(always)]
fn rotate_left_48(x: u64, r: usize) -> u64 {
    ((x << r) | (x >> (48 - r))) & 0x0000_ffff_ffff_ffff
}
#[inline(always)]
fn speck96_round(x: &mut u64, y: &mut u64, k: u64) {
    x = rotate_right_48(x, 8);
    x = (x).wrapping_add(y) & 0x0000_ffff_ffff_ffff;
    x ^= k;
    y = rotate_left_48(y, 3);
    y ^= x;
}
#[inline(always)]
fn speck96_unround(x: &mut u64, y: &mut u64, k: u64) {
    y ^= x;
    y = rotate_right_48(y, 3);
    x ^= k;
    x = (x).wrapping_sub(y) & 0x0000_ffff_ffff_ffff;
    x = rotate_left_48(x, 8);
}
pub fn speck96_encrypt(pt: [u8; 12], key: [u64; 2]) -> [u8; 12] {
    let mut x = u64::from_le_bytes([pt[0], pt[1], pt[2], pt[3], pt[4], pt[5], 0, 0]);
    let mut y = u64::from_le_bytes([pt[6], pt[7], pt[8], pt[9], pt[10], pt[11], 0, 0]);
let mut a = key[0];
let mut b = key[1];

for i in 0..7 {
    speck96_round(&amp;mut b, &amp;mut a, i);
    speck96_round(&amp;mut y, &amp;mut x, a);
}

let combined = (x as u128) | ((y as u128) &lt;&lt; 48);
combined.to_le_bytes()[..12].try_into().unwrap()

}

The problem that I have is that simulating 48-bit words with 64-bit registers ends up generating quite inefficient code with long dependency chains that kill any instruction level parallelism. Basically, I need to clear upper 16-bits before the rotate instructions and this seems to be the cause.

Are there any versions of the algorithm that would operate on a triple of 32-bit registers instead?

fgrieu · Accepted Answer · 2024-08-01T05:09:20.347

This should be pin compatible and I hope it's more scheduling-friendly.

#[inline(always)]
fn rotate_right_48(x: u64, r: usize) -> u64 {
    ((x << (64-r)) >> 16) | (x >> r)
}
#[inline(always)]
fn rotate_left_48(x: u64, r: usize) -> u64 {
    ((x << (16+r)) >> 16) | (x >> (48-r))
}

An additional optimization is possible by putting the useful 48 bits in the high-order bits of a 64-bit word, and keeping the low 16-bits at zero. That will allow to remove the masking after wrapping_add. It's necessary to adjust speck96_encrypt for the new position of the payload, and the rotations to

#[inline(always)]
fn rotate_right_48(x: u64, r: usize) -> u64 {
     ((x >> (16+r)) << 16) || (x << (48-r))
}
#[inline(always)]
fn rotate_left_48(x: u64, r: usize) -> u64 {
     ((x >> (64-r)) << 16) || (x << r)
}

As rightly commented by @poncho, it might help to allow the unused 16 bits to be anything rather than assuming they are zero and forcing them to be that. That requires other changes to the rotations, and merely removing & 0x0000_ffff_ffff_ffff in the originals won't do.

Independently: I would fear that the use of & for arguments and * for referencing them in speck96_round, combined with the rotations not having a native instruction, messes up code optimization. In one application I have no choice but using a C compiler that falls prey to that, in which case I use preprocessor macros rather than referencing and forceinline for a massive improvement.

Efficiently implementing Speck96

1 Answers1