As Ross's answer explains, the standard widely-used way is to spill (and later reload) something else to free up a tmp reg.
You're shooting yourself in the foot by loading everything into registers first, instead of loading as needed. Sometimes you can even use an arg as a memory source operand without a separate mov load at all.
But to answer the title question:
Despite the question title, my answer on swapping 2 registers in 8086 assembly language(16 bits) does exactly address swapping a register with memory efficiently, avoiding xchg because of the implicit lock prefix. Spill (and later reload) a tmp reg, or in the worst case, XOR-swap between reg and mem. That's horrible, and basically serves to illustrate why your whole approach will lead to an inefficient implementation.
(As Ross says, you probably aren't (yet) capable of writing asm more efficient than compilers will make. Once you understand how to create efficient asm (Agner Fog's optimization guide and microarch guide: https://agner.org/optimize/, and other links in https://stackoverflow.com/tags/x86/info) and can spot actual inefficiencies in optimized compiler output, then you could sometimes write better asm by hand if you wanted to. (Usually with compiler output as a starting point). But normally you'd just use that experience to tweak your C source to get better asm from your compiler if possible, because that's more useful/portable long-term. And it rarely matters enough to be worth hand-writing asm.
At this point you're more likely to learn techniques for making more efficient asm from looking at gcc -O3 output. But missed optimizations are not rare, and if you spot some you might report them on GCC's bugzilla.)
The implicit-lock semantics of xchg come from 386. The lock prefix existed since 8086 back then, for use with instructions like add/or/and/etc [mem], reg or immediate. Including lock xchg, which apparently didn't have implicit lock behaviour (even without the prefix) until 386. Or maybe not documented until then? IDK why Intel made that change. Perhaps for primitive SMP 386 systems.
The other instructions you mention were added later: bts/btr/btc in 386 (but weren't intended only for shared memory, thus an implicit lock wouldn't have made sense).
xadd in 486, and cmpxchg not until Pentium. (486 had an undocumented opcode for cmpxchg, see an old version of the NASM appendix A for commentary on it). These CPUs were designed later than 386, presumably after some initial experience with primitive SMP systems.
As you say, Intel wisely chose to not make lock implicit for those new instructions, even though the primary use-case was for atomic operations in multi-threaded code. SMP x86 machines started to become a thing with 486 and Pentium, but sync between threads on a UP machine didn't need lock. This is sort of the opposite question of Is x86 CMPXCHG atomic, if so why does it need LOCK?
8086 was a uniprocessor machine, so for synchronization between software threads, plain add [mem], reg is already atomic with respect to interrupts and thus to context swithces. (And it's impossible to have multiple threads executing at once). The legacy #LOCK external signal the docs still mention only mattered wrt. DMA observers, or for MMIO to I/O registers on devices (rather than to plain DRAM).
(On modern CPUs, xchg [mem], reg on cacheable memory that isn't split across a cache-line boundary only takes a cache-lock, making sure the line stays in MESI Exclusive or Modified state from the load reading L1d to the store committing to L1d.)
I don't know why the 8086 architect(s) (primarily Stephen Morse designed the instruction set) chose not to make non-atomic xchg with memory available. (Correction, I think he did, and it was only 386 that changed it; this answer was originally written before I knew that it was a 386 change.) Maybe on 8086 it wasn't much slower to have the CPU assert #LOCK while doing the store + load transaction? But then we were stuck with those semantics for the rest of x86. x86 design has rarely been very forward-thinking, and if the main use-case for xchg was for atomic I/O then it saved code-size to make lock implicit.
There is no way to disable the implicit lock in xchg [mem], reg
You need to use multiple different instructions. An xor-swap is possible but very inefficient. Still maybe not as bad as xchg, depending on the microarchitecture and the surrounding code (how much it sucks to wait for all previous stores to execute and commit to L1d cache before doing any later loads). e.g. some in flight cache-miss stores could make it very expensive vs. memory-destination xor which can leave data in the store buffer.
Compilers basically never use xchg even between registers (because it's not cheaper than 3 mov instructions on Intel, so it's not generally a useful peephole optimization to look for). They use it only to implement std::atomic stores with seq_cst memory order (because it's more efficient than mov + mfence on most uarches: Why does a std::atomic store with sequential consistency use XCHG?), and to implement std::atomic::exchange. But not std::swap with reg or memory.
It would occasionally be useful if x86 had a non-atomic 2 or 3 uop swap reg,mem, but it doesn't. There is no such instruction.
But especially with x86-64 having 16 registers, you're only having this problem because you created it for yourself. Leave yourself some scratch regs for computation.