Are there any efficient implementations of Speck96? The problem is how to efficiently do 48-bit arithmetic on 64-bit words, the rotates in particular. I've been trying to implement it and have written the following in Rust:
#[inline(always)]
fn rotate_right_48(x: u64, r: usize) -> u64 {
((x >> r) | (x << (48 - r))) & 0x0000_ffff_ffff_ffff
}
#[inline(always)]
fn rotate_left_48(x: u64, r: usize) -> u64 {
((x << r) | (x >> (48 - r))) & 0x0000_ffff_ffff_ffff
}
#[inline(always)]
fn speck96_round(x: &mut u64, y: &mut u64, k: u64) {
x = rotate_right_48(x, 8);
x = (x).wrapping_add(y) & 0x0000_ffff_ffff_ffff;
x ^= k;
y = rotate_left_48(y, 3);
y ^= x;
}
#[inline(always)]
fn speck96_unround(x: &mut u64, y: &mut u64, k: u64) {
y ^= x;
y = rotate_right_48(y, 3);
x ^= k;
x = (x).wrapping_sub(y) & 0x0000_ffff_ffff_ffff;
x = rotate_left_48(x, 8);
}
pub fn speck96_encrypt(pt: [u8; 12], key: [u64; 2]) -> [u8; 12] {
let mut x = u64::from_le_bytes([pt[0], pt[1], pt[2], pt[3], pt[4], pt[5], 0, 0]);
let mut y = u64::from_le_bytes([pt[6], pt[7], pt[8], pt[9], pt[10], pt[11], 0, 0]);
let mut a = key[0];
let mut b = key[1];
for i in 0..7 {
speck96_round(&mut b, &mut a, i);
speck96_round(&mut y, &mut x, a);
}
let combined = (x as u128) | ((y as u128) << 48);
combined.to_le_bytes()[..12].try_into().unwrap()
}
The problem that I have is that simulating 48-bit words with 64-bit registers ends up generating quite inefficient code with long dependency chains that kill any instruction level parallelism. Basically, I need to clear upper 16-bits before the rotate instructions and this seems to be the cause.
Are there any versions of the algorithm that would operate on a triple of 32-bit registers instead?