3

I have an already SIMD compared __m128i register, which results in like something:

0, 0, -1, -1, 0, 0, 0, 0 // in shorts
0, -1, 0, 0 // in ints

What is the fastest/cheapest way to get the position of the int where the bits are set? There is only one int inside of the __m128i set to 1.

Example:

-1, -1, 0, 0, 0, 0, 0, 0  ->  0
0, 0, -1, -1, 0, 0, 0, 0  ->  1
0, 0, 0, 0, -1, -1, 0, 0  ->  2
0, 0, 0, 0, 0, 0, -1, -1  ->  3

One additional note, I have only AVX and lower available, so no AVX2 or AVX-512. I'm using C++ and Intel instrincs.


Edit: This is my current code:

__m128i comparableLow = _mm_set_epi32(key - 1, key - 1, key - 1, key - 1);
__m128i comparableHigh = _mm_set_epi32(key + 1, key + 1, key + 1, key + 1);

__m128i mData = _mm_loadu_si128((__m128i*)(arr));
__m128i l1 = _mm_cmpgt_epi32(mData, comparableLow);
__m128i u1 = _mm_cmplt_epi32(mData, comparableHigh);
__m128i r1 = _mm_and_si128(u1, l1);
Paul R
  • 208,748
  • 37
  • 389
  • 560
NFoerster
  • 386
  • 2
  • 16
  • 3
    `vmovmskps` / `bsf` (or `tzcnt`). See [Get the last line separator](https://stackoverflow.com/q/50496029). If this is the result of a `vpcmpeqd` or `vcmpps`, you have dword elements so you can use `movmskps` to get a bitmap of the high bit. Or if you always have pairs of `int16_t`? IDK why you're showing it as `int16_t` elements if they always come in 32-bit chunks. – Peter Cordes May 24 '18 at 07:23
  • 1
    Possible duplicate of [Get the last line separator](https://stackoverflow.com/questions/50496029/get-the-last-line-separator) (except that was for byte elements so maybe not). – Peter Cordes May 24 '18 at 07:23
  • Sorry one thing to add, i'm using c++ not assembler. – NFoerster May 24 '18 at 07:28
  • 1
    Then use intrinsics for those instructions, to get your compiler to emit them, of course. `unsigned bitmap = _mm_movemask_ps( _mm_castsi128_ps(v));` / `int pos = _bit_scan_forward(bitmap);`. (Or whatever BSF intrinsic your compiler likes best: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=411,3941,3941,433,1508,408,5534,392&techs=AVX,AVX2,Other&text=bsf) – Peter Cordes May 24 '18 at 07:30
  • Is there some reason you aren't using `_mm_cmpeq_epi32(mData, _mm_set1_epi32(key))`? That's equivalent to your `(x > (key-1)) & (x < (key+1))` unless I'm missing something. And just to clarify, you want the result as an `int`, rather than as the low element of a `__m128i`, right? But movmskps / bsf is probably still the fastest way, even if it takes a `movd` to get back to an `__m128i`. – Peter Cordes May 24 '18 at 07:33
  • I'm using the comparison for tree traversal, so in the internal nodes i will not find the key itself, i have to determine in which range the key is to follow the correct pointer(or offset). – NFoerster May 24 '18 at 07:43
  • 1
    Then either your example is over-simplified from your real code (with `+1` / `-1` instead of your actual range), or you have a bug, because as written I think it's exactly equivalent to checking for exact equality only. But ok, you'll want the offset as an `int` or `unsigned int`, so that's good, `movmskps` / `bsf` is exactly what you want. – Peter Cordes May 24 '18 at 07:47
  • BTW, you might want to align your b-tree elements or whatever they are, if that doesn't waste a lot of space, so your loads never cross a cache-line boundary. On Intel hardware, cache-line split loads add at least 5 cycles of load-use latency to pointer chasing in the critical path of your tree-traversal. – Peter Cordes May 24 '18 at 07:52
  • Can you give me a hint how to determine the cach-line size and how to avoid the crossing? – NFoerster May 24 '18 at 08:01
  • Assume the cache-line size is 64B. That's the case on all CPUs with AVX, and is unlikely to change in the future. But you only need your data to be 16-byte aligned so your vector load will be naturally aligned and will never cross any boundaries wider than that. – Peter Cordes May 24 '18 at 08:22
  • 1
    And BTW, this may not actually be a duplicate of anything! I couldn't find any other SO posts with `[x86] code:_mm_movemask_ps _BitScanForward`. movemask -> bitscan is a pretty well-known idiom, at least I thought it was. You'll find it it most libc `memchr`/`strchr`. Often people only want to know if any or all elements met a condition, rather than *which one*, or to use the `movemask` result as an integer index, but I'm really surprised I didn't find an existing answer with that code. But an asm and C versions of it got posted in 1 day! I may type up an answer if I get around to it. – Peter Cordes May 24 '18 at 08:27

0 Answers0