x86_64 efficiently set bit N in register DST if N < width(DST), otherwise DST = 0

Question

I am trying to find an efficient way to do the following in x86_64 assembly:

if(N < word_size) {
    dst[N] = 1; // as in Nth bit of dst = 1 
}
else {
    dst[word_size - 1:0] = 0
}

Alternative I could get the desired result if the "else" case did not unset the other bits, or if the "if" case did unset the other bits. The important thing is that if N > word_size it will not set any bits

I am unable to find any instruction that might do this as bt[s/c], shlx, sal, rol, shld all appear to the take the module of src by width.

The use case is basically I will be iterating over a bit vector with a known length and want to either A) find the first set bit and return its position, or B) test all of the bits and if no set bit is found return length of the vector.

// rsi has length
L(keep_searching):
movq %(rdi), %rax
testq %rax, %rax
jnz L(found)
subq $64, rsi
jbe L(done) // this done will return origional value of rsi
addq $8, %rdi
jmp L(keep_searching)

I figure this could be vastly sped up if I could quickly set a bit in rax if rsi < 64 so I could drop the second branch. But for this to work it needs to have the behavior above i.e it can't set the bit of rsi % 64, it needs to set iff rsi < 64.

Does anyone know of an instruction that can do this? Every instruction I can think of to check uses module on src. Any help would be greatly appreciated.

Thanks!

A few versions that are working well for me for 32 bit. If I use MMX @PeterCordes pointed out pllsq is exactly what I want.


uint64_t __attribute__((noinline, noclone)) shift(uint64_t cnt) {
    uint64_t ret = 0;
    asm volatile(
        "cmpq $32, %[cnt]\n\t"
        "setbe %b[ret]\n\t"
        "shlxq %[cnt], %[ret], %[ret]\n\t"
        : [ ret ] "+r"(ret)
        : [ cnt ] "r"(cnt)
        : "cc");
    return ret;
}


uint64_t __attribute__((noinline, noclone)) shift2(uint64_t cnt) {
    uint64_t ret = 0, tmp = 0;
    asm volatile(
        "leaq -33(%[cnt]), %[tmp]\n\t"
        "movl $1, %k[ret]\n\t"
        "shlxq %[cnt], %[ret], %[ret]\n\t"
        "sarq $63, %[tmp]\n\t"
        "andq %[tmp], %[ret]\n\t"
        : [ ret ] "+r"(ret), [ tmp ] "+r"(tmp), [ cnt ] "+r"(cnt)
        :
        : "cc");
    return ret;
}


uint64_t __attribute__((noinline, noclone)) shift3(uint64_t cnt) {
    uint64_t ret, tmp;
    asm volatile(
        "leaq -33(%[cnt]), %[tmp]\n\t"
        "btsq %[cnt], %[ret]\n\t"
        "sarq $63, %[tmp]\n\t"
        "andq %[tmp], %[ret]\n\t"
        : [ ret ] "+r"(ret), [ tmp ] "+r"(tmp), [ cnt ] "+r"(cnt)
        :
        : "cc");
    return ret;
}

You ould use the `setcc` instruction, where `cc` is the condition. `setcc` Sets the specified register to 1 if the condition is true and sets to 0 if false. Seems like the perfect instruction for your situation. — mediocrevegetable1, Feb 10 '21 at 07:22
so you think ```cmpq $64, %rsi; setbe %rcx; shlx %rsi, %rcx, %rcx```? Ill give that a try and see how it runs. Nice idea! — Noah, Feb 10 '21 at 07:39
x86 SIMD shifts (like `psllq xmm1, xmm2`) saturate the count instead of wrapping. But that seems unlikely to be useful for *this* case. Probably just do the last iteration separately so you can vectorize the search (checking 16 bytes (2 qwords or 4 dwords) at a time for all-non-zero). And definitely put one of the conditional branches at the bottom and drop the `jmp`, like `sub $64, %rsi` / `ja L(keep_searching)`. [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926) — Peter Cordes, Feb 10 '21 at 08:40
`setcc r/m8` is inconvenient because it only works on an 8-bit register. But yeah if you had a zeroed register like RDX you could `setbe dl` / `shlx %rsi, %rdx, %rcx`. ORing that into RAX before a `test` / `jz keep_searching` seems unlikely to be worth it even if you don't want to use SIMD. A correctly predicted not-taken test/jnz is a single uop, cheaper than all the work you're doing to create an RAX value, and you'd have to decode RAX again when you're done. — Peter Cordes, Feb 10 '21 at 08:46
@PeterCordes my thought with this is if I do it I'll be able to make all returns be one not taken branch. Another alternative to do that, however, is ```cmovcc``` — Noah, Feb 10 '21 at 09:02
Why are you asm blocks inside inline asm? Are you actually writing this in C? Also, they don't need to be `volatile`: they're pure functions of the input, and only need to run to produce the output, not for any other hidden side effects. Also, `"+r"(tmp)` reads `tmp` uninitialized in shift3. (Same for ret). You want `"=r"` for those pure-outputs that don't need to be zeroed. — Peter Cordes, Feb 10 '21 at 09:59
@PeterCordes it will be pure asm in the end. Just find it easier to test / play around with using inline asm. — Noah, Feb 10 '21 at 18:45

Aki Suihkonen · Answer 1 · 2021-02-10T09:21:20.240

2

Haven't verified, but

mov rax, 1  // common
mov rdx, 0  // common

cmp rcx, 64
shlxq rbx, rax, rcx
cmova rbx, rdx

could be slightly more performant than the suggested alternative as the comparison and the shift are now independent and can be executed in parallel.

EDIT

From the use case it may seem this is an XY-problem -- an efficient way to iterate over bits in bitset is to use the n & (n-1) trick or variants; popcount(n ^ (n-1)) should give the index of the least bit set. n&=n-1 will clear the LSB.

edited Feb 10 '21 at 09:21

answered Feb 10 '21 at 09:04

Aki Suihkonen

19,144
1
36
57

Getting about 10 - 15% slower with this method vs ```cmp``` + ```setcc```. – Noah Feb 10 '21 at 09:09
1

@Noah: Did you hoist the setup of RAX and RDX out of the loop? That's what "common" means here. – Peter Cordes Feb 10 '21 at 09:12
@PeterCordes no, my fault! hoisted, still slower but closer to 10% than 15% – Noah Feb 10 '21 at 09:16
1

the ```n &= (n - 1)``` isnt necessary because once I find a set bit I return. I can use ```popcnt(n ^ (n - 1))``` (or probably better yet ```tzcnt```) but then I need to figure out of the lowest set bit is above or below the value specified in ```rsi```. Thats doable but I was trying to find a way so the search for set bit and bounds checking could be merged. Once I have it all working will see if it pays off or if @PeterCordes is right and I'm better off with a second ```cmp``` + ```jcc``` – Noah Feb 10 '21 at 09:29
@Noah -- `tzcnt(n & mask)`, where `mask = 0xffffffff >> (N & 63)`; – Aki Suihkonen Feb 10 '21 at 09:42
@Noah: Oh, yeah if you *do* need to decode that value you're cooking up, branches inside the loop is almost certainly the best way to eventually end up with separate paths of execution for found vs. not-found. If you want to amortize more work over a fixed number of branches (for back-end throughput on pre-Haswell?), unroll by ORing together two qwords from memory and using the FLAGS result of that. – Peter Cordes Feb 10 '21 at 09:44
@Aki: I think Noah is expecting the bitset to have many whole qwords that are all 0 before they get to the first set bit. At least for the search loop being discussed; just a linear search for the first non-zero qword. Bit-iteration tricks like `n & (n-1)` ([BMI1 `blsr`](https://www.felixcloutier.com/x86/blsr) are useful if you have multiple scattered bits in a qword. IDK why you'd use popcnt(blsmsk(n)) though, instead of tzcnt(n) which also produces 64 for the n==0 case. – Peter Cordes Feb 10 '21 at 09:57

Peter Cordes · Answer 2 · 2021-02-10T09:52:00.940

2

SIMD shifts (like SSE2 psllq xmm1, xmm2) saturate the count, but that's unlikely to be useful here because you I think want to OR this into the data from memory as an end condition for a scalar version of this loop?

I'd be more inclined to use cmov from a zeroed register using FLAGS still set from sub. You can create 1<<(rsi&63) using BTS into a zeroed register before or after SUB; before SUB is good because BTS modifies CF. Note that rsi&63 is not affected by rsi -= 64.

This is probably not a good choice for a loop condition: just use a single-uop sub/ja, with a separate the test/jnz being normally not taken. One of these goes at the bottom instead of an unconditional jmp: that's the most obvious and basic optimization here: Why are loops always compiled into "do...while" style (tail jump)?

Or even better, use SSE2 (baseline for x86-64) to check 16 bytes (2 qwords or 4 dwords) at a time for all-non-zero. Or even POR together a couple vectors to check all at once if you expect not to find the first set bit soon, i.e. tune for large trip counts at the expense of slower handling of eventually finding it. (The last last iteration can be scalar).

(Have a look at glibc's strlen or especially memchr for more ideas about optimizing the large-array not-found-early case with SIMD. In that case they're using pminub to get a zero if any vector had a zero at that position, but you want the opposite: por to get a non-zero if any had a non-zero.)

ORing together two values from memory works for scalar, too, as a way to unroll.

    mov  (%rdi), %rax
    or  8(%rdi), %rax
    jnz  found
    ...
    add  $16, %rdi

But note that or/jnz is 2 uops while test/jnz is 1.
OTOH, getting cmpq $0, (%rdi) / jne to micro- and macro-fuse on Intel may not be possible; IIRC maybe with a register source. So memory-source or may be costing 2 more uops to do twice as much work, instead of just 1 more, if you tune really aggressively. You'd need to compare against a loop where you unroll and do two separate load/test/jcc or cmp-mem/jcc, to keep it fair for the loop overhead of pointer-increment logic. (And also unrolling logic to handle a possible odd number of qwords.)

But just as an exercise, let's see what we can do with your idea: in this case the non-zero shift result can be computed once ahead of time (because rsi-=64 doesn't change rsi%64), and hoisted out of the loop.

   xor  %edx, %edx
   bts  %rsi, %rdx        # rdx = 1 << (rsi&63)

// rsi has length
L(keep_searching):
   add  $8, %rdi
   xor  %eax, %eax        # need to re-create a zero every time
   sub  $64, %rsi
   cmovbe %rdx, %rax      # 0  or  1<<(rsi&63) to put a bit there for us to find

   or  -8(%rdi), %rax
   jnz  L(keep_searching)

found_or_done:
   tzcnt %rax, %rax
   add   orig_rsi?, %rax
   ...

Unfortunately OR can't macro-fuse with JCC the way TEST can. (Or SUB on Intel SnB-family). But memory-source OR is a single uop for the front-end.

Unfortunately cmovbe and cmova cost 2 uops because they need CF and ZF. (See What is a Partial Flag Stall? - recent Intel don't have partial-flag stalls or even merging, just CF vs. the rest (SPAZO), with uops reading both inputs separately if they need them.) But for no apparent reason, setbe and seta are also 2 uops (https://uops.info) - maybe Intel never updated the setcc uop format to work as a 3-input uop (including the full register that they merge into the low byte of). Fun fact: this leads to setcc in general only being able to decode in the "complex" decoder, if your code isn't in the uop cache: Can the simple decoders in recent Intel microarchitectures handle all 1-µop instructions?

The loop body is 7 uops total on Intel thanks to cmov. 6 on AMD.

Compare vs. 4 uops for a simple scalar search loop without this trick. This can run as fast as 1 cycle per iteration on Intel (Haswell and later can run 2 branches per clock as long as at most 1 is taken). Also on AMD Zen I think. So we're searching about 8 bytes per cycle, about half what we could do with SIMD. But the startup and end overhead is lower.

L(loop):                 # do {
    mov   (%rdi), %rax       # 1 uop
    test  %rax,%rax
    jnz   L(found)           # 1 uop (macro-fused)
    add   $8, %rdi           # 1 uop
    cmp   %rdi, %rdx
    jbe   L(loop)         # }while(p < endp)   # 1 uops (macro-fused)
L(done):

If you counted a negative array index up towards zero, you could avoid having both add and sub in the loop: use FLAGS set by add. (Or for the simple version, avoiding the CMP, letting

You'd need an or (%r10, %r11, 8), %rax or something like that, but Haswell and later can keep that indexed addressing mode micro-fused as part of a 2-operand instruction with a RW destination: Micro fusion and addressing modes)

setcc r/m8 is inconvenient because it only works on an 8-bit register. But if you had a zeroed register like RDX you could setbe dl / shlx %rsi, %rdx, %rcx. ORing that into RAX before a test / jz keep_searching seems unlikely to be worth it even if you don't want to use SIMD.

A correctly predicted not-taken test/jnz is a single uop, cheaper than all the work you're doing to create an RAX value.

edited Feb 10 '21 at 09:52

answered Feb 10 '21 at 09:41

Peter Cordes

328,167
45
605
847

1

Regarding the negative index incremented towards 0; your saying ```addq $64, %rsi``` + ```cmovc``` is faster because only requires carry flag? – Noah Feb 10 '21 at 19:02
For some reason I cant get reasonable performance with ```cmovcc```. The ```setcc``` approach continues to outperformance (did hoist the ```btsq```). – Noah Feb 10 '21 at 19:11
Actually Its the ```add``` / ```sub```. ```cmp``` + ```cmovcc``` wins. – Noah Feb 10 '21 at 19:16
1

But you seem to be correct in saying that seperate comparisons is the way to go – Noah Feb 10 '21 at 19:21
@Noah: tiny loops can be finicky and show differences for not much apparent reason. But usually when there's a loop-carried dependency chain, or especially two separate ones, that are bottlenecks so uop scheduling could miss progress on them. That wouldn't be the case here; I'd expect just total uop count, and possibly back-end ports, to matter. Maybe code size and/or alignment of the top of the loop can matter for how it packs into the uop cache; you can isolate that with an `align 64` before the loop. – Peter Cordes Feb 10 '21 at 21:36
@Noah: Yes, `cmovc` = `cmovb` and all other forms of cmov (like `cmovle`) are single-uop; only the two I mentioned, `cmova` / `cmovbe` (and their `na` synonyms), need 2 uops. And yes, it's because they need CF and another flag (both ZF), as well as 2 GP integer register inputs, for a total of 4, but a single uop can only have 3 inputs. This is all Intel-only; AMD has single-uop cmova and everything else. – Peter Cordes Feb 10 '21 at 21:39
wanted to bounce an idea off you. I was thinking it might be fun to post some .S file from glibc (like strlen-avx2.S) to either micro-optimization tag or codereview like once a month and see what the community can come up with to improve the code then submit a patch a glibc. Reason I ask is this question is for a potential patch I'm making for strlen-avx2.S and realize there are so many questionable design decisions I'm probably going to need to make a few more questions so why not cut out the middle man. What do you think? – Noah Feb 15 '21 at 01:00
@Noah: Interesting idea, yeah; I've been meaning to submit a patch for strcasecmp to more efficiently force ASCII to lowercase but never got around to it. Probably codereview.SE would be a good place, although I don't check it regularly (so I guess ping me somehow). I'm not sure SO would work as well; without a specific use-case and/or microarchitecture in mind, it's not as specific as SO would normally want, although multiple different answers that are good for different cases could be useful to future readers. – Peter Cordes Feb 15 '21 at 01:08
@Noah: As far as posting original glibc source in questions, I'm not sure the LGPL is compatible with SO/SE's CC-BY-SA; since it's not your code we should check on that. I think SO/SE frowns upon posting code with a disclaimer that it's under a separate license and can't be copied the same way. It would also make it difficult for people to base answers on it; the answers would be derivative of the LGPL original, so still arguably not pure CC-BY-SA. Probably you should ask on meta.stackoverflow or meta.stackexchange.com about the whole idea and the license issue. – Peter Cordes Feb 15 '21 at 01:08
2

@PeterCordes [CC-BY SA 4.0 -> GPLv3](https://creativecommons.org/2015/10/08/cc-by-sa-4-0-now-one-way-compatible-with-gplv3/) only IIRC @ Noah if the code is not your code then no matter the licence you won't be allowed to post on Code Review, for the [authorship of code reason](https://codereview.meta.stackexchange.com/q/1294). – Peilonrayz Feb 15 '21 at 01:19
@PeterCordes "SO/SE's CC-BY-SA"? Its a bit tough. The question is essentially "For what optimization could be made to this implementation of ". Not a super specific question so you issues with SO, but I tend to think it will be more informative a resource for micro-optimization / whatever arch than it will standard code reviews. – Noah Feb 15 '21 at 01:19
1

Well as @Peilonrayz just pointed out I guess SO is the only possibility. My guess is people would find it fun and likely a lot of common optimization techniques would be explained and applied to a clear example that would serve as useful references. – Noah Feb 15 '21 at 01:25
1

@Noah: Yeah, I've often linked glibc strlen or memcmp as examples of how to optimize something like that in asm. It seems that posting GPL work under CC-BY-SA4.0 is not legal, though (only the reverse), so there's still a question whether you can quote the original code being discussed, which I think would be a big improvement over just linking and asking an open-ended question about how to implement it. A good question might dissect and explain the good parts of the glibc code, like how it works and what optimizations are already present, as well as asking for improvements. – Peter Cordes Feb 15 '21 at 01:29
@PeterCordes thats fair. I am no lawyer but if you can link the origional and likely paste snippets to ask questions about them it seems you could build an interesting post: "Understanding and improving the current optimized implementation of for " -> link to post, interesting snippets + explinations, snippets that seem most prime for being improved"? Would that cross the boundary? – Noah Feb 15 '21 at 01:34
@PeterCordes > I've been meaning to submit a patch for strcasecmp to more efficiently force ASCII to lowercase but never got around to it Can you put me on it goldstein.w.n@gmail.com. Don't like the follow the devel thread as its span AF. Recently did strchr-avx2.S, working on strlen-avx2.S, was planning to to memchr / strcmp next. – Noah Feb 15 '21 at 01:38
1

@Noah: You should really ask on meta.stackoverflow.com whether something like that would be good. It seems ok to me, though. re: strcasecmp: See `upcase_si128` in [Convert a String In C++ To Upper Case](https://stackoverflow.com/a/37151084). IIRC, it's especially nice with AVX so it doesn't even need more `movdqa` than glibc's version. – Peter Cordes Feb 15 '21 at 01:43
A slight improvement for 32 bit is to use ```addq``` instead of ```orl``` in the ```keep_searching``` loop and then or when found. I.e [this](https://godbolt.org/z/eG686M) should be ```addq %rdi, %rsi; jnz L5```. – Noah Mar 15 '21 at 17:33
@Noah: Oh good point. Hint for Godbolt: just prototype functions that you don't want to inline; no need to define them and then have to stop the compiler from knowing about their internals. e.g. https://godbolt.org/z/WTG7x4 is easier to look at and not cluttered with their definitions. (Not using `jcc` directly as a tailcall is a missed optimization I've reported a while ago, still not fixed :/) – Peter Cordes Mar 15 '21 at 20:33
@Noah: you'd normally be using SSE2 pcmpeqb / pmovmskb / bsf or some variation thereof. (Maybe with a shift for the first unaligned vector, to shift out bits from an aligned 16 bytes that aren't part of the string). No obvious use-case for cmov. If you have a more specific question about tuning memchr, ask it as a separate SO question, or look at what glibc memchr does. – Peter Cordes Mar 17 '21 at 02:02
@PeterCordes yeah realized that was probably a bit much. But in the [remaining len <= 128 case](https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memchr-avx2.S.html#198) there is a decision about ```cmovcc``` vs branches for how to return ptr to data vs ```NULL```. – Noah Mar 17 '21 at 04:40
@Noah: Oh right, for the not-found case. I'd guess that branch might predict well; a lot of code will use memchr on buffers that definitely do contain a hit (in non-error cases). Or if not-found happens at all, it might be common. Still, on modern x86 where CMOV is only single uop, I'd consider it if I could avoid too many extra MOV and xor-zeroing uops to make it work. OTOH, letting speculative exec decouple the NULL return from the data dependency on the compares is really good, and if you got to the last vector of the buf then probably it's not there either, so correlation with prev JCC. – Peter Cordes Mar 17 '21 at 04:48

x86_64 efficiently set bit N in register DST if N < width(DST), otherwise DST = 0

2 Answers2