1

I have a __mmask64 mask register and I need to chop it into 4 __mmask16 mask registers.

I (incorrectly) assumed that the following line of code would have done the trick:

__mmask16 mask_16 = static_cast<__mmask16>(mask_64 >> 16);

But I get (Intel c++ compiler 18.0):

kmovq       r14,k1  
shr         r15,10h
kmovw       k2,r15d

Since the Intel Intrinsics Guide does not have a something like _mm512_kshift(k, imm8) and the definition for example _mm512_kand is just:

#define _mm512_kand(k1, k2) ((__mmask16) ((k1) & (k2)))

I assumed shifting would have given me a KSHIFTRW.

Question: How to generate a KSHIFTRW with C++.

Edit: I just found a related question with a sufficient answer: Missing AVX-512 intrinsics for masks?

HJLebbink
  • 719
  • 1
  • 11
  • 32
  • This is a missed optimization in most compilers. gcc/clang/icc almost always move to integer regs and back, even for one instruction they could have done with a `k` instruction. – Peter Cordes Nov 17 '17 at 10:12
  • Just looked at your original goal again. You want to unpack one mask to 4. If you don't need the upper bits zeroed for each mask, that's just 3x `kshiftrq k2/3/4, k1, 16 / 32 / 48`, which is 3 uops for port 5 on Skylake-AVX512. `kmovw k, r32` also needs port 5, so either way you compete with vector ALU ops. Moving to GP registers is strictly worse, even if it's clever and uses 3x `rorx` / `kmovw`. – Peter Cordes Nov 18 '17 at 03:41
  • Unfortunately store/reload isn't viable either: `kmovw k, [mem16]` is 3 uops: load + port5 + any-ALU-port (p0156). That's according to IACA at least (according to the http://instlatx64.atw.hu/ spreadsheet, although it says `p237`, which is bogus because p7 only has a store-AGU, not a load port.) Anyway, sounds like `k`-register loads are done as GP-integer loads and then a `kmov` internally. – Peter Cordes Nov 18 '17 at 03:45
  • @PeterCordes Thank you for the valuable info. I guess it is more efficient (in my specific situation) to replace the cmp instruction that gave the 64-bit mask with 4 xmm comparisons that yield 16-bit masks. – HJLebbink Nov 18 '17 at 10:21
  • I guess, if you can do that without any extra shuffles. On SKL-AVX512, if you're not bottlenecked on the front-end, letting the compile move to integer for the unpack is not too bad. – Peter Cordes Nov 19 '17 at 03:42

0 Answers0