Why does my "=r"(var) output not pick the same register as "a"(var) input?

Question

I'm learning how to use __asm__ volatile in GCC and came up with a problem. I want implement a function performing atomic compare and exchange and returning the value that was previously stored in the destination.

Why does an "=a"(expected) output constraint work, but an "=r"(expected) constraint lets the compiler generate code that doesn't work?

Case 1.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

uint64_t atomic_cas(uint64_t * destination, uint64_t expected, uint64_t value){
    __asm__ volatile (
        "lock cmpxchgq %3, %1":
        "=a" (expected) :
        "m" (*destination), "a" (expected), "r" (value) :
        "memory"
    );

    return expected;
}

int main(void){
    uint64_t v1 = 10;
    uint64_t result = atomic_cas(&v1, 10, 5);
    printf("%" PRIu64 "\n", result);           //prints 10, the value before, OK
    printf("%" PRIu64 "\n", v1);               //prints 5, the new value, OK
}

It works as expected. Now consider the following case:

Case 2.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

uint64_t atomic_cas(uint64_t * destination, uint64_t expected, uint64_t value){
    __asm__ volatile (
        "lock cmpxchgq %3, %1":
        "=r" (expected) ://<----- I changed a with r and expected GCC understood it from the inputs 
        "m" (*destination), "a" (expected), "r" (value) :
        "memory"
    );

    return expected;
}

int main(void){
    uint64_t v1 = 10;
    uint64_t result = atomic_cas(&v1, 10, 5);
    printf("%" PRIu64 "\n", result);            //prints 5, wrong
    printf("%" PRIu64 "\n", v1);                //prints 5, the new value, OK 
}

I examined generated assembly and noticed the following things:

I. In both of the cases the function code is the same and looks as

   0x0000555555554760 <+0>:     mov    rax,rsi
   0x0000555555554763 <+3>:     lock cmpxchg QWORD PTR [rdi],rdx
   0x0000555555554768 <+8>:     ret

II. The problem came when GCC inlined the atomic_cas so in the later case the correct value was not passed to the printf function. Here is the related fragment of disas main:

0x00000000000005f6 <+38>:    lock cmpxchg QWORD PTR [rsp],rdx
0x00000000000005fc <+44>:    lea    rsi,[rip+0x1f1]        # 0x7f4
0x0000000000000603 <+51>:    mov    rdx,rax ;  <-----This instruction is absent in the Case 2.
0x0000000000000606 <+54>:    mov    edi,0x1
0x000000000000060b <+59>:    xor    eax,eax

QUESTION: Why does the replacing rax(a) with an arbitrary register (r) produce wrong result? I expected it worked in both of the cases?

UPD. I compile with the following flags -Wl,-z,lazy -Warray-bounds -Wextra -Wall -g3 -O3

It's just restoring the registers used. How exactly does it not work? What compiler flags are you passing? — S.S. Anne, Oct 04 '19 at 15:18
@JL2210 The second case does not return the correct value that was stored before CAS. It returns the stored value. What I was confused most was that the function assembly is the same, but inline causes the result to be different. — St.Antario, Oct 04 '19 at 15:21
What version of GCC are you using? I can't reproduce the assembly. — S.S. Anne, Oct 04 '19 at 15:28
@JL2210 I use GCC 7.4.0. Can you share your assembly for the case 2? — St.Antario, Oct 04 '19 at 15:30
@JL2210 _Is the 10 vs 11 difference intentional?_ That was a typo. Fixed, thanks. — St.Antario, Oct 04 '19 at 15:31
@JL2210 Why did `atomic_cas` appeare in PLT section? Did you link it dynamically? — St.Antario, Oct 04 '19 at 15:37
@JL2210 I copied-paste the flags you reffered to in pastebin and still got the wrong assembly. Do you use `gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0` on Ubuntu 18.04? — St.Antario, Oct 04 '19 at 15:51
No, I use `gcc version 8.3.0 (GCC)`. I'm not on Ubuntu and I never will be. — S.S. Anne, Oct 04 '19 at 15:53
Your GDB disassembly for the working version doesn't match the source. GDB is putting a string constant into RSI, the 2nd arg, and the CAS return value into RDX, the 3rd arg. And a `1` into the first arg. Does your Ubuntu GCC expand printf to `dprintf(int fd, const char *fmt, ...)`? — Peter Cordes, Oct 04 '19 at 16:13

interjay · Answer 1 · 2019-10-04T20:21:56.700

5

The cmpxchg instruction always puts the result in the rax register. So you need to use the a constraint to tell GCC to move from that register. In case 2, you tell GCC to use an arbitrary register instead by using r, but you don't put anything in that register.

If you want to use r, you'll have to add a mov instruction to move the result from rax to that register (movq %%rax, %0). You'd also have to tell GCC that the rax register is changed by the instruction, for example by adding it to the "clobbers" section of the asm statement. For your case, there isn't a reason to complicate things in this manner.

edited Oct 04 '19 at 20:21

answered Oct 04 '19 at 15:51

interjay

107,303
21
270
254

Sounds reasonable. – St.Antario Oct 04 '19 at 15:53
**`mov %%rax, %0` would not be safe**. `cmpxchg` would still destroy the input RAX (on failure), but `"a"(expected)` is a read-only input. You'd have to use `"+r"(expected)` or a dummy `"=a"` output if you wanted `"=r"(oldval)` to work. But that's massively overcomplicated and has a useless `mov %rax,%rax` in cases where the compiler *does* pick RAX as the output so **it's much better to leave the `mov` generation to the compiler** and just accurately describe as small an asm template as possible using constraints. – Peter Cordes Oct 04 '19 at 17:17

Peter Cordes · Accepted Answer · 2019-10-04T19:34:22.153

4

First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm. There is basically zero reason to roll your own CAS, vs. using bool __atomic_compare_exchange(type *ptr, type *expected, type *desired, bool weak, int success_memorder, int failure_memorder) https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html. This works even on non-_Atomic variables.

"=r" tells gcc it can ask for the output in whatever register it wants, so it can avoid having to mov the result there itself. (Like here where GCC wants the output in RSI as an arg for printf). And/or so it can avoid destroying the input it put in the same register. That's the entire point of =r instead of specific-register constraints.

If you want to tell GCC that the register it picks for input is also the output register, use "+r". Or in this case since you need it to pick RAX, use "+a"(expected).

There's already syntax for making the compiler pick the same register for 2 constraints with separate variables for input and output, specifically matching constraints: "=r"(outvar) : "0"(invar).

It would be a missed optimization if the syntax didn't let you describe a non-destructive instruction that could produce output in a different register from the input(s).

You can see what GCC actually picked by using the constraint in a comment.

Remember that GNU C inline asm is just text substitution into your template. The compiler literally has no idea what the asm instructions do, and doesn't even check they're valid. (That only happens when the assembler reads the compiler output).

    ...
    asm volatile (
    "lock cmpxchgq %3, %1   # 0 out: %0  |  2 in: %2" 
    : ...
    ...

The resulting asm shows the problem very clearly (Godbolt GCC7.4):

        lock cmpxchgq %rsi, (%rsp)   # 0 out: %rsi  |  2 in: %rax
        leaq    .LC0(%rip), %rdi
        xorl    %eax, %eax
        call    printf@PLT

(I used AT&T syntax so your cmpxchgq %reg,mem would match the mem,reg operand order documented by Intel, although both GAS and clang's built-in assembler seem to accept it in the other order, too. Also because of the operand-size suffix)

GCC takes the opportunity to ask for the "=r"(expected) output in RSI as an arg for printf. Your bug is that your template makes a wrong assumption that %0 will expand to rax.

There are lots of examples of the lack of implicit connection between input and output that happen to use the same C var. For example, you can swap 2 C variables with an empty asm statement, just using constraints. How to write a short block of inline gnu extended assembly to swap the values of two integer variables?

edited Oct 04 '19 at 19:34

answered Oct 04 '19 at 16:20

Peter Cordes

328,167
45
605
847

_"=r" tells gcc it can ask for the output in whatever register it wants_ But since I wrote `return expected;` should not the value in a whatever register GCC picked for `"=r" (expected)` moved to `rax` on return? So the semantic of `atomic_cas` would be: "Pick whatever output register you want for the inline asm, but as soon as the inline asm is done, return the output value from the register that was picked". – St.Antario Oct 06 '19 at 14:45
1

@St.Antario: Yes, that's why GCC picks RAX for the non-inline version. But part of the point of inlining is to gets rid of calling-convention overhead like having to move `expected` into RAX. I'm not sure if you think that leaving a value in RAX will end up getting it returned even though the compiler wants to return a value it chose to ask for in a different register, but I hope not because that doesn't make sense. Just like falling off the end of a non-`void` function after inline asm leaves something in RAX, but worse the compiler would destroy RAX with a `mov` from some other reg. – Peter Cordes Oct 06 '19 at 14:51
1

@St.Antario: Internally in the compiler, optimization including inlining happens on an [SSA form](https://en.wikipedia.org/wiki/Static_single_assignment_form) of your program logic (GIMPLE), not on asm instructions. At that level, register allocation hasn't even happened yet. So by the time we get to an RTL pass where register allocation has to happen, inlining has already happened and the `"=r"` output can just connect directly to where GCC wants it in the caller. – Peter Cordes Oct 06 '19 at 14:54
1

It would be a missed optimization if it copied to RAX and back, and it would be wrong code if it copied from RAX *instead* of the reg it picked as the output of the asm statement. – Peter Cordes Oct 06 '19 at 14:54
As it was mentioned in ExtendedAsm manual _The mere fact that foo is the value of both operands is not enough to guarantee that they are in the same place in the generated assembler code._ Also by specifying simply `"=r" (expected)` I did not inform neither that the `lock cmpxchg` clobbers `rax` (which is also an input operand) nor that the result should be contained in exactly `rax` register so missed the optimization. So the version with matching constraints where input and output was tied looks clearer to me: `"=a (expected) : "m" (*destination), "0" (expected), "r" (value)` – St.Antario Oct 06 '19 at 16:14
@St.Antario: I'm not sure what point you were trying to make with your first comment in this thread. Were you trying to summarize, or asking for clarification? I assumed clarification... – Peter Cordes Oct 06 '19 at 16:19
1

@St.Antario: But no, writing `"=r"(expected)` isn't a "missed optimization", it's a *bug* in your source code. Other than that, the facts you state in your last comment are true, but IDK what point you're making. IMO the clearest is `"+r"(expected)`, or using a separate variable for the return value with `"=a"(output)` and `"a"(expected)`. Matching constraints are unnecessary when you have a fixed register. `return expected` is weird because at that point it's *not* the expected value anymore. (Not to mention that you also left out flag-output constraints for a boolean status). – Peter Cordes Oct 06 '19 at 16:22
In the previous message I tried to summarize what I got combining your answer and [ExtendedAsm manual](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Extended-Asm). – St.Antario Oct 07 '19 at 06:49

Why does my "=r"(var) output not pick the same register as "a"(var) input?

2 Answers2

Linked