How to exchange top of stack with register without implicit lock on latest 64 bit Intel CPUs?

Question

The x64 calling convention uses registers up to the first 4 parameters (rcx, rdx, r8, r9) and passes the rest of the parameters on the stack. In this case the obvious way for dealing with the supplementary parameters in an asm procedure would be the following:

procedure example(
  param1, //rcx
  param2, //rdx
  param3, //r8
  param4, //r9
  param5,
  param6
);
asm
  xchg param5, r14 // non-volatile registers, should be preserved
  xchg param6, r15 // non-volatile registers, should be preserved

  // ... procedure body, use r14–r15 for param5–param6

  mov r15, param6
  mov r14, param5  
end;

But there is a huge problem here: if memory operations are involved, XCHG instructions in Intel CPUs have an implicit LOCK which also means a huge performance penalty; that is, the bus would be locked for hundreds of clock cycles in the worst case. (BTW, I cannot really understand this implicit LOCK as having really usable and smart interlocking instructions like XADD, CMPXCHG, BTS/BTR etc.; the naked XCHG would be the last option for me if I needed thread synchronization.) So what should I do here if I want something short and elegant for using/saving/restoring params5 and params6 in/from registers? Is there perhaps a hack for preventing bus locking for XCHG instructions? Generally, what is the standard, widely used way for this situation?

Why do you need `xchg` at all? Just `mov` and be done with it. — Jester, Apr 27 '19 at 13:57
You can swap using a temp register and 3 moves, or swap using 3 xors (a ^=b, b^=a, a^=b). — rcgldr, Apr 27 '19 at 14:26
@Jester, sorry, I need `r10` and `r11` for other purposes, let me change the above example to `r14` and `r15`. That is, speaking about registers which should be preserved by the end of the routine. — Zoltán Bíró, Apr 27 '19 at 16:56
@rcgldr, if I had free temp registers I would’n need swap at all. — Zoltán Bíró, Apr 27 '19 at 16:57
Despite the question title, my answer on the linked duplicate [swapping 2 registers in 8086 assembly language(16 bits)](//stackoverflow.com/q/26469196) *does* exactly address swapping a register with memory efficiently, avoiding `xchg` because of the implicit lock prefix. **The standard widely-used way is to spill (and later reload) something else to free up a tmp reg.** — Peter Cordes, Apr 27 '19 at 17:05
LOCK doesn't imply locking the bus on modern CPUs, instead it just results in a cache lock which would happen anyways. The cost comes from the fact LOCK'd instructions stall the pipeline in order to provide strict memory order semantics. — Ross Ridge, Apr 27 '19 at 17:06
@RossRidge: That's true if the memory operand doesn't cross a cache-line boundary. I'm not clear on exactly how bad a split `lock` is, but it's pretty bad because it has to ensure atomicity across both lines, even for non-CPU observers like DMA. With cache-coherent DMA, maybe it can just lock both lines, though. I think we can assume that's not a possible concern here, and yeah it's still more efficient to spend more instructions to avoid `xchg`. — Peter Cordes, Apr 27 '19 at 17:08
@PeterCordes Yah, I was going answer the question by saying do what compilers do. Load the arguments from the stack into registers as necessary, spilling registers to their own locations on the stack if need be. — Ross Ridge, Apr 27 '19 at 17:21
@RossRidge: Hmm, maybe it's not an actual duplicate, an answer addressing the X-Y problem would be better. Want me to reopen so you can post one? — Peter Cordes, Apr 27 '19 at 17:30
There is still no reason to use `xchg`. Spill your `r14`/`r15` to the stack using `mov` (or even `push`) then load the params using `mov`s too. — Jester, Apr 27 '19 at 17:58
@Peter gave a good answer and received an upvote in the original post. But I’m still curious about this implicite lock for `XCHG` as I sill consider that this instruction is far not the best for thread synchronizing. — Zoltán Bíró, Apr 27 '19 at 18:41
@ZoltánBíró: there is no way to disable the implicit `lock` in `xchg [mem], reg`. You need to use multiple different instructions. An xor-swap is possible but very inefficient. Still maybe not as bad as `xchg`, depending on the microarchitecture and the surrounding code (how much it sucks to wait for all previous stores to execute and commit to L1d cache before doing any later loads). e.g. some in flight cache-miss stores could make it very expensive. — Peter Cordes, Apr 27 '19 at 18:58
Compilers basically never use `xchg` even between registers. They use it only to implement `std::atomic` stores with seq_cst memory order, and to implement `std::atomic::exchange`. — Peter Cordes, Apr 27 '19 at 18:59
It would occasionally be useful if x86 had a microcoded but non-atomic `swap reg,mem`, but it doesn't. There is no such instruction, otherwise I would have used it in my answer on the linked duplicate. (Or if there was a new one, written a new answer for x86-64). But really your entire plan of filling up all the registers with params is normally a bad idea. Leave yourself some scratch regs for computation. Ross's comment about not doing this in the first place, and/or saving the call-preserved regs to their own locations, is the normal way because it's more efficient. — Peter Cordes, Apr 27 '19 at 19:09

Peter Cordes · Answer 1 · 2022-09-03T18:53:33.930

As Ross's answer explains, the standard widely-used way is to spill (and later reload) something else to free up a tmp reg.

You're shooting yourself in the foot by loading everything into registers first, instead of loading as needed. Sometimes you can even use an arg as a memory source operand without a separate mov load at all.

But to answer the title question:

Despite the question title, my answer on swapping 2 registers in 8086 assembly language(16 bits) does exactly address swapping a register with memory efficiently, avoiding xchg because of the implicit lock prefix. Spill (and later reload) a tmp reg, or in the worst case, XOR-swap between reg and mem. That's horrible, and basically serves to illustrate why your whole approach will lead to an inefficient implementation.

(As Ross says, you probably aren't (yet) capable of writing asm more efficient than compilers will make. Once you understand how to create efficient asm (Agner Fog's optimization guide and microarch guide: https://agner.org/optimize/, and other links in https://stackoverflow.com/tags/x86/info) and can spot actual inefficiencies in optimized compiler output, then you could sometimes write better asm by hand if you wanted to. (Usually with compiler output as a starting point). But normally you'd just use that experience to tweak your C source to get better asm from your compiler if possible, because that's more useful/portable long-term. And it rarely matters enough to be worth hand-writing asm.

At this point you're more likely to learn techniques for making more efficient asm from looking at gcc -O3 output. But missed optimizations are not rare, and if you spot some you might report them on GCC's bugzilla.)

The implicit-lock semantics of xchg come from 386. The lock prefix existed since 8086 back then, for use with instructions like add/or/and/etc [mem], reg or immediate. Including lock xchg, which apparently didn't have implicit lock behaviour (even without the prefix) until 386. Or maybe not documented until then? IDK why Intel made that change. Perhaps for primitive SMP 386 systems.

The other instructions you mention were added later: bts/btr/btc in 386 (but weren't intended only for shared memory, thus an implicit lock wouldn't have made sense).

xadd in 486, and cmpxchg not until Pentium. (486 had an undocumented opcode for cmpxchg, see an old version of the NASM appendix A for commentary on it). These CPUs were designed later than 386, presumably after some initial experience with primitive SMP systems.

As you say, Intel wisely chose to not make lock implicit for those new instructions, even though the primary use-case was for atomic operations in multi-threaded code. SMP x86 machines started to become a thing with 486 and Pentium, but sync between threads on a UP machine didn't need lock. This is sort of the opposite question of Is x86 CMPXCHG atomic, if so why does it need LOCK?

8086 was a uniprocessor machine, so for synchronization between software threads, plain add [mem], reg is already atomic with respect to interrupts and thus to context swithces. (And it's impossible to have multiple threads executing at once). The legacy #LOCK external signal the docs still mention only mattered wrt. DMA observers, or for MMIO to I/O registers on devices (rather than to plain DRAM).

(On modern CPUs, xchg [mem], reg on cacheable memory that isn't split across a cache-line boundary only takes a cache-lock, making sure the line stays in MESI Exclusive or Modified state from the load reading L1d to the store committing to L1d.)

I don't know why the 8086 architect(s) (primarily Stephen Morse designed the instruction set) chose not to make non-atomic xchg with memory available. (Correction, I think he did, and it was only 386 that changed it; this answer was originally written before I knew that it was a 386 change.) Maybe on 8086 it wasn't much slower to have the CPU assert #LOCK while doing the store + load transaction? But then we were stuck with those semantics for the rest of x86. x86 design has rarely been very forward-thinking, and if the main use-case for xchg was for atomic I/O then it saved code-size to make lock implicit.

There is no way to disable the implicit lock in `xchg [mem], reg`

You need to use multiple different instructions. An xor-swap is possible but very inefficient. Still maybe not as bad as xchg, depending on the microarchitecture and the surrounding code (how much it sucks to wait for all previous stores to execute and commit to L1d cache before doing any later loads). e.g. some in flight cache-miss stores could make it very expensive vs. memory-destination xor which can leave data in the store buffer.

Compilers basically never use xchg even between registers (because it's not cheaper than 3 mov instructions on Intel, so it's not generally a useful peephole optimization to look for). They use it only to implement std::atomic stores with seq_cst memory order (because it's more efficient than mov + mfence on most uarches: Why does a std::atomic store with sequential consistency use XCHG?), and to implement std::atomic::exchange. But not std::swap with reg or memory.

It would occasionally be useful if x86 had a non-atomic 2 or 3 uop swap reg,mem, but it doesn't. There is no such instruction.

But especially with x86-64 having 16 registers, you're only having this problem because you created it for yourself. Leave yourself some scratch regs for computation.

_“You're shooting yourself in the foot by loading everything into registers first, instead of loading as needed”_ True, but in this special case param5 and param6 are used as memory base pointers and I wouldn’t gain too much by deferring their move in dedicated registers.. — Zoltán Bíró, Apr 28 '19 at 18:04

score 2 · Answer 2 · answered Apr 27 '19 at 19:59

Just do what compilers do. Load the arguments from the stack into registers as you need them, spilling registers to their own locations on the stack as necessary to free up registers to do so. This is the standard and widely used, if not very elegant, method for dealing with the problem of needing more registers than are available.

Also note that the Windows x64 calling convention requires that "non-volatile" (callee-saved) registers must be saved only in the prologue. (Although you can use chained unwind info to have multiple "prologues" in a function.)

So assuming you need use all the callee-saved registers and are following the Windows x64 calling convention strictly, you'd need to something like this:

example PROC    FRAME

_stack_alloc =  8   ; total stack allocation for local variables
                    ; must be MOD 16 = 8, so the stack is aligned properly;
_push_regs =    32  ; total size in bytes of the callee-saved registers
                    ; pushed on the stack

_param_adj =    _stack_alloc + _push_regs

; location of the parameters relative to RSP, including the incoming
; slots reserved for spilling parameters passed in registers

param1  =   _param_adj + 8h
param2  =   _param_adj + 10h
param3  =   _param_adj + 18h
param4  =   _param_adj + 20h
param5  =   _param_adj + 28h
param6  =   _param_adj + 30h

; location of local variables relative to RSP

temp1   =   0

    ; Save some of the callee-preserved registers
    push    rbp
    .PUSHREG rbp
    push    rbx
    .PUSHREG rbx
    push    rsi
    .PUSHREG rsi
    push    rdi
    .PUSHREG rdi

    ; Align stack and allocate space for temporary variables
    sub rsp, _stack_alloc
    .ALLOCSTACK 8

    ; Save what callee-preserved registers we can in the incoming
    ; stack slots reserved for arguments passed in registers under the
    ; assumption there's no need to save the later registers

    mov [rsp + param1], r12
    .SAVEREG r12, param1
    mov [rsp + param2], r13
    .SAVEREG r13, param2
    mov [rsp + param3], r14
    .SAVEREG r14, param3
    mov [rsp + param4], r15
    .SAVEREG r15, param4

    .ENDPROLOG

    ; ...

    ; lets say we need to access param5 and param6, but R14 
    ; is the only register available at the moment.  

    mov r14, [rsp + param5]
    mov [rsp + temp1], rax  ; spill RAX 
    mov rax, [rsp + param6]

    ; ...

    mov rax, [rsp + temp1]  ; restore RAX

    ; ...

    ; start of the "unofficial" prologue

    ; restore called-preserved registers that weren't pushed

    mov r12, [rsp + param1]
    mov r13, [rsp + param2]
    mov r14, [rsp + param3]
    mov r15, [rsp + param4]

    ; start of the "official" prologue
    ; instructions in this part are very constrained. 

    add rsp, _stack_alloc
    pop rdi
    pop rsi
    pop rbx
    pop rbp
    ret

example ENDP

Now hopefully you're asking yourself if you really need to do all this, and the answer is yes and no. There's not much you can do to simplify the assembly code. If you don't care about exception handling you don't need the unwind info directives, but you still need pretty everything else if you want your code to be as efficient as what a compiler can generate while still keeping relatively easy to maintain.

But there is a way to avoid having to do all this, just use a C/C++ compiler. There's really not much need for assembly for these days. It's unlikely you can write faster code than the compiler and you can use intrinsics to access pretty much any special assembly instruction you want to use. The compiler can worry about where stuff lives on the stack, and it can do a very good job at register allocation, minimizing the amount register saving and spilling necessary.

(Microsoft's C/C++ compiler can even generate that chained unwind info I mentioned earlier so that callee-saved registers can be saved only when necessary.)

_“There's really not much need for assembly for these days.”_ Well, “these days” people mostly code applications for smartphones where indeed might not much need for assembly. But I’m curious how you would code time-critical press applications like real-time halftoning or fixed point transparency flattening of huge images without intensively using SIMD (SSE) instructions. As implementing such a code invokes hundreds of assembly instructions I wont’ use intrinsics instead of true asm code even if I coded in C++. But I use Delphi and there are small chances to give up and migrate to C++. — Zoltán Bíró, Apr 28 '19 at 17:58
@ZoltánBíró Often, the compiler is able to generate better code than an assembly programmer, even when writing SSE and AVX code. Intrinsics are definitely the tool of choice for many situations as they delegate the task of register allocation and instruction scheduling to the compiler, something it is just much better at than the programmer in most cases. — fuz, Apr 28 '19 at 18:20
@ZoltánBíró Those applications are perfectly suited to using intrinsics in C/C++. Writing hand crafted assembly that beats the code that a compiler can generate is much harder than you would think. You've apparently already decided that Delphi isn't suitable for whatever task you're trying accomplish by using assembly, so your choice here isn't between C/C++ and Delphi, but C/C++ and assembly. Your inexperience with C/C++ will hurt you far less than your inexperience with assembly. — Ross Ridge, Apr 28 '19 at 18:21
_“Also note that the Windows x64 calling convention requires that "non-volatile" (callee-saved) registers must be saved only in the prologue.”_ OOOPS. You do say that it isn’t safe to do the following: `asm` `push r14` `push r15` _(... code …)_ `pop r15` `pop r14` `end` ? If not, why? (BTW I might have inexperience in x64 assembly but I have been working intensively in x86 asm for 15 years.) — Zoltán Bíró, Apr 28 '19 at 18:33
@fuz, this might be true for x64, but definitely not for x86. AFAIK none of the compilers use XMM registers for code generating in x86 mode. — Zoltán Bíró, Apr 28 '19 at 18:37
@ZoltánBíró I'm guessing you're trying to use inline assembly in Delphi. I don't have any experience with it, but according to the documentation you should use `.PUSHNV` to save non-volatile registers: http://docwiki.embarcadero.com/RADStudio/Rio/en/Assembly_Procedures_and_Functions Also compilers like GCC, Clang and Microsoft C++ are perfectly capable of generating code that uses XMM registers in 32-bit x86 code, whether through the use of intrinsics, auto-vectorization or simply just through the use of floating-point arithmetic. https://godbolt.org/z/dvCkx3 — Ross Ridge, Apr 28 '19 at 19:12
@ZoltánBíró gcc and clang definitely do if you allow them to. They can even automatically vectorise loops. And if you use intrinsics, they generate pretty much exactly the instructions you specify though they may chose better instruction sequences than those you wrote (making intrinsics a very good choice!). — fuz, Apr 28 '19 at 19:18
@RossRidge, yes, `.PUSHNV` is OK, I have already used its sibling `.SAVENV` for XMM registers but still don’t understand that the above sequence of simply pushing and popping volatile registers why wouldn’t work? I guess it might be about stack alignment, but in this case I’m not allowed to used pop/push at all, even for 64 bit registers? Note that until now I made function calls from assembly (I mean calls where stack frame manipulations were involved) by disassembling a Delphi routine and copying the technics from there. — Zoltán Bíró, Apr 28 '19 at 19:35
@ZoltánBíró It'll work until something throws an exception and the unwinder has to has unwind the stack frame through your inline assembly function. Without the .PUSHNV and the unwind information it generates the unwinder won't be able to restore the non-volatile registers. If you don't care what happens in that case then you don't need to use it, and all that matters is that your function returns with the same values in the non-volatile registers as they had when the function entered. — Ross Ridge, Apr 28 '19 at 19:48
@RossRidge I’ve taken a look on many “why C++ generates better code than my assembly”-like questions and now I understand what you mean: I’ve seen even an example where the author divided by 2 using `div` assembly instruction. OMG. Well, I assume that the programmer knows not just about `sar` but he even can multiply by 17 with a shift and an addition, not speaking about the elementary tricks of using `xor` for zeroing, `cmovcc` instead of jumping where possible etc. In addition, if you take care of interlacing long latency instructions, you can write better asm code **when necessary**. — Zoltán Bíró, May 07 '19 at 09:34

How to exchange top of stack with register without implicit lock on latest 64 bit Intel CPUs?

2 Answers2

There is no way to disable the implicit lock in `xchg [mem], reg`

Linked

How to exchange top of stack with register without implicit lock on latest 64 bit Intel CPUs?

2 Answers2

There is no way to disable the implicit lock in xchg [mem], reg

Linked

There is no way to disable the implicit lock in `xchg [mem], reg`