64 bit assembly, when to use smaller size registers

Question

I understand in x86_64 assembly there is for example the (64 bit) rax register, but it can also be accessed as a 32 bit register, eax, 16 bit, ax, and 8 bit, al. In what situation would I not just use the full 64 bits, and why, what advantage would there be?

As an example, with this simple hello world program:

section .data
msg: db "Hello World!", 0x0a, 0x00
len: equ $-msg

section .text
global start

start:
mov rax, 0x2000004      ; System call write = 4
mov rdi, 1              ; Write to standard out = 1
mov rsi, msg            ; The address of hello_world string
mov rdx, len            ; The size to write
syscall                 ; Invoke the kernel
mov rax, 0x2000001      ; System call number for exit = 1
mov rdi, 0              ; Exit success = 0
syscall                 ; Invoke the kernel

rdi and rdx, at least, only need 8 bits and not 64, right? But if I change them to dil and dl, respectively (their lower 8-bit equivalents), the program assembles and links but doesn't output anything.

However, it still works if I use eax, edi and edx, so should I use those rather than the full 64-bits? Why or why not?

Actually in Linux (and probably everything else?) the parameters to a syscall are 32-bits wide, so you should use EDI and EDX. http://www.win.tue.nl/~aeb/linux/lk/lk-4.html#ss4.3 — Matty K, Jul 05 '11 at 04:01
what about rax, should that change to eax as well? I tried changing those 3 and it works, but what I want to know is why I should do this and what is the advantage. — mk12, Jul 05 '11 at 04:10
In the case of this program, the only appreciable difference is that the literal values (4, 1, 0, etc.) are twice as big when they're 64-bit, so your program will be a few bytes larger and, in theory, could take longer to load into the CPU from the disk/memory. — Matty K, Jul 06 '11 at 03:32
So there's no reason to use the full 64 bits when you don't need to, right? (I know there's also no reason to hand code assembly, but I just want to make sure..) — mk12, Jul 06 '11 at 21:03
@MattyK: `mov r64, sign-extended-imm32` is 7 bytes, vs. 5 for `mov r32, imm32`. In GAS, you can use `movabs` to request `mov r64, imm64`, but NASM/YASM only choose that encoding based on the size of the constant. (And in fact NASM optimizes small constants to `mov r32, imm32` when you write the destination as `rdi`. I'm not sure about symbol addresses; it might leave them as `imm64` in case you're not using the "small" code model and you have symbols with addresses about 32 bit. It won't optimize `mov rdi,0` to `xor edi,edi` though, because of the side-effect on flags.) — Peter Cordes, Aug 31 '17 at 04:45
related: [The advantages of using 32bit registers/instructions in x86-64](//stackoverflow.com/q/38303333). For putting constants in registers, only 32-bit zero-extends implicitly to 64-bit. For putting addresses in registers, 10-byte `mov r64, imm64` works but is terrible; use RIP-relative `lea rsi, [rel msg]`. MacOS uses 64-bit addresses unavoidably so you can't optimize with `mov esi, msg` like you can on Linux. — Peter Cordes, Oct 28 '19 at 03:24

score 6 · Answer 1 · answered Jul 06 '11 at 14:36

You are asking several questions here.

If you just load the low 8 bits of a register, the rest of the register will keep its previous value. That can explain why your system call got the wrong parameters.

One reason for using 32 bits when that is all you need is that many instructions using EAX or EBX are one byte shorter than those using RAX or RBX. It might also mean that constants loaded into the register are shorter.

The instruction set has evolved over a long time and has quite a few quirks!

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

If you just need 32-bit registers, you can safely work with them, this is OK under 64-bit. But if you just need 16-bit or 8-bit registers, try to avoid them or always use movzx/movsx to clear the remaining bits. It is well known that under x86-64, using 32-bit operands clears the higher bits of the 64-bit register. The main purpose of this is avoid false dependency chains.

Please refer to the relevant section - 3.4.1.1 - of The Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1:

32-bit operands generate a 32-bit result, zero-extended to a 64-bit result in the destination general-purpose register

Breaking dependency chains allows the instructions to execute in parallel, in random order, by the Out-of-Order algorithm implemented internally by CPUs since Pentium Pro in 1995.

A Quote from the Intel® 64 and IA-32 Architectures Optimization Reference Manual, Section 3.5.1.8:

Code sequences that modifies partial register can experience some delay in its dependency chain, but can be avoided by using dependency breaking idioms. In processors based on Intel Core micro-architecture, a number of instructions can help clear execution dependency when software uses these instruction to clear register content to zero. Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

Assembly/Compiler Coding Rule 37. (M impact, MH generality): Break dependencies on portions of registers between instructions by operating on 32-bit registers instead of partial registers. For moves, this can be accomplished with 32-bit moves or by using MOVZX.

The MOVZX and MOV with 32-bit operands for x64 are equivalent - they all break dependency chains.

That's why your code will execute faster if you always try clear the highest bits of larger registers when using smaller registers. When the bits are always cleard, thre are no dependencies on the previous value of the register, the CPU can internally rename the registers.

Register renaming is a technique used internally by a CPU that eliminates the false data dependencies arising from the reuse of registers by successive instructions that do not have any real data dependencies between them.

Matty K · Accepted Answer · 2017-09-03T23:27:52.267

3

First and foremost would be when loading a smaller (e.g. 8-bit) value from memory (reading a char, working on a data structure, deserialising a network packet, etc.) into a register.

MOV AL, [0x1234]

versus

MOV RAX, [0x1234]
SHR RAX, 56
# assuming there are actually 8 accessible bytes at 0x1234,
# and they're the right endianness; otherwise you'd need
# AND RAX, 0xFF or similar...

Or, of course, writing said value back to memory.

(Edit, like 6 years later):

Since this keeps coming up:

MOV AL, [0x1234]

only reads a single byte of memory at 0x1234 (the inverse would only overwrite a single byte of memory)
keeps whatever was in the other 56 bits of RAX
- This creates a dependency between the past and future values of RAX, so the CPU can't optimise the instruction using register renaming.

By contrast:

MOV RAX, [0x1234]

reads 8 bytes of memory starting at 0x1234 (the inverse would overwrite 8 bytes of memory)
overwrites all of RAX
assumes the bytes in memory have the same endianness as the CPU (often not true in network packets, hence my SHR instruction years ago)

Also important to note:

MOV EAX, [0x1234]

reads 4 bytes of memory starting at 0x1234 (the inverse would overwrite 4 bytes of memory)
overwrites all of RAX, but the high bits will all be zero
- see: Why do most x64 instructions zero the upper part of a 32 bit register

Then, as mentioned in the comments, there is:

MOVZX EAX, byte [0x1234]

only reads a single byte of memory at 0x1234
extends the value to fill all of EAX (and thus RAX) with zeroes (eliminating the dependency and allowing register renaming optimisations).

In all of these cases, if you want to write from the 'A' register into memory you'd have to pick your width:

MOV [0x1234], AL   ; write a byte (8 bits)
MOV [0x1234], AX   ; write a word (16 bits)
MOV [0x1234], EAX  ; write a dword (32 bits)
MOV [0x1234], RAX  ; write a qword (64 bits)

edited Sep 03 '17 at 23:27

answered Jul 05 '11 at 02:18

Matty K

3,781
2
22
19

2

Erm... x86_64 is _always_ little endian, so your examples will yield different results. – Ruslan Aug 12 '15 at 19:55
2

The best choice here is `movzx eax, [0x1234]`. – Peter Cordes Aug 31 '17 at 01:30
1

Peter cordes is right. The "mov al" doesn't break the dependency chain. – Maxim Masiutin Aug 31 '17 at 22:52
1

Registers don't really have an endianness. Left-shift multiplies by powers of 2, right shift divides by powers of 2. (So the MSB is the left-most bit.) The concept only applies when you can load the bytes from separate addresses. (And in that case, your first example should be `movzx eax, [0x1234 + 7]` if you really want to invent some weird case for the first example of your answer. Hint: people learning assembly often have enough problems with endianness without an example that assumes big-endian data in memory when the question wasn't about that. I'd suggest just deleting that part.) – Peter Cordes Sep 01 '17 at 01:55
@PeterCordes what would be the best way to write "There is a single byte of data in memory, at 0x1234, which I want to read into a register. This is how I would read it, given one of: (a) I don't know what's adjacent to it in memory or if I even have access to read those locations; or (b) I do." ? – Matty K Sep 01 '17 at 02:02
@MattyK: In that case, `movzx eax, byte [0x1234]` to zero-extend it into a [64-bit register](https://stackoverflow.com/questions/11177137/why-do-most-x64-instructions-zero-the-upper-part-of-a-32-bit-register), avoiding any [partial-register penalties or false dependencies](https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to). On current Intel CPUs, it decodes to a single load uop (no ALU needed). `movsx eax, byte [0x1234]` would be good if you wanted to sign-extend it to a 32-bit signed integer. (Or into `rax`...) – Peter Cordes Sep 01 '17 at 02:06
@PeterCordes Cool thanks. And now what about if I've already read one byte (or part-byte) of the value into AH, which, because of awesome protocol standardisation, is located somewhere else in memory; now I need to read the final 8 bits into AL so I can operate on the whole value? (Just pointing out, not all dependencies are false ones.) – Matty K Sep 01 '17 at 02:12
On older Intel, it was usually a good idea to avoid any partial-register true dependencies, because merging caused stalls. AMD doesn't rename any partial registers separately from the full register, so merging happens during the write, not when you later read the full reg. You'd want to `movzx` the low byte first to clear the reg, then `mov ah, [high]`. That might also be optimal on Intel Haswell (see the link in my previous comment for my experimental tests of exactly how partial regs work on HSW/SKL, since Agner Fog doesn't describe it). On Core2, you might `movzx` / `shl 8` / `or`. – Peter Cordes Sep 01 '17 at 02:27
Of course if your bytes are contiguous but in the big-endian order, load them all and endian-swap with `bswap eax` or `rax`, or `ror ax,8`. (For PDP-endian, well that's more work :P Maybe best to actually do two loads in that case.) – Peter Cordes Sep 01 '17 at 02:29
Cool. So what should I change about the answer to clear the downvote? Nothing in it is untrue (?), and it does have a reference to the dependency issue. Do I need to come up with a more specific use-case example to validate ever wanting to read into a sub-register? – Matty K Sep 03 '17 at 22:53
I never saw your reply because you didn't @notify me. Found this question again while looking for a duplicate for another recent Q. The first part is still wrong: `mov rax, [mem]` / `shr rax, 56` is similar to a byte load from `[mem+7]`, not from `[mem]`. x86 is little-endian and registers don't have endianness. Also, byte loads are not a good example of when you should use narrow regs: `movzx eax, byte [mem]` is x86's "normal" byte-load like ARM `ldrb`. `mov al, [mem]` is a merge into the low byte of RAX which you rarely want. Narrow regs are useful for unpacking ints to bytes/words. – Peter Cordes Jul 23 '19 at 21:27
1

If I'm not mistaken, you can always `AND` the lower part of a value into the register containing the higher part, f.e. you have 0x1230 0000 and you want 123 to be in the lower part, you can `and rxx, 0x123`. Should be correct for high value into register containing lower part, too, with the difference you'd have a longer number.... Correct me if I'm wrong – clockw0rk Oct 02 '19 at 13:09

score 1 · Answer 4 · answered Jul 05 '11 at 02:19

1

If you want to work with only an 8-bit quantity, then you'd work with the AL register. Same for AX and EAX.

For example, you could have a 64-bit value that contains two 32-bit values. You can work on the low 32-bits by accessing the EAX register. When you want to work on the high 32-bits, you can swap the two 32-bit quantities (reverse the DWORDs in the register) so that the high bits are now in EAX.

answered Jul 05 '11 at 02:19

Jim Mischel

131,090
20
188
351

How would I go about swapping the 32 bit quantities? – mk12 Jul 05 '11 at 02:21
What would the actual instruction in nasm be though? I'm kind of new to this. – mk12 Jul 05 '11 at 02:31
1

ROL or ROR, for rotate left or right, respectively. In this case it doesn't matter which direction. There's also RCL and RCR for rotating with carry, which are subtly different. – Matty K Jul 05 '11 at 03:58
2

This works as long as "work with" is read-only. Writing a 32-bit register zeros the upper 32. https://stackoverflow.com/questions/11177137/why-do-most-x64-instructions-zero-the-upper-part-of-a-32-bit-register. Since x86-64 includes SSE2 as baseline, if you want SIMD packing of multiple values per register, you can use the XMM registers. (SWAR does have its place, though, and doing a 64-bit load and unpacking with ALU shifts can be useful.) – Peter Cordes Aug 31 '17 at 01:33

score 1 · Answer 5 · answered Jul 05 '11 at 02:20

1

64-bit is the largest piece of memory you can work with as a single unit. That doesn't mean that's how much you need to use.

If you need 8 bits, use 8. If you need 16, use 16. If it doesn't matter how many bits, then it doesn't matter how many you use.

Admittedly, when on a 64-bit processor, there's very little overhead to use the full 64 bits. But if, for example, you are calculating a byte value, working with a byte will mean the result will already be the correct size.

answered Jul 05 '11 at 02:20

Jonathan Wood

65,341
71
269
466

32-bit operand size is usually the best: smallest code-size (no REX or operand-size prefixes), and no partial-register merging / false dependencies. (8-bit also has no prefixes necessary, but can create slowdowns with partial-register stuff if you aren't careful to understand the situation for both AMD CPUs (no partial reg renaming) vs. P6 / early SnB-family vs. Haswell and later. [Why doesn't GCC use partial registers?](//stackoverflow.com/q/41573502)) – Peter Cordes Jul 23 '19 at 21:31

64 bit assembly, when to use smaller size registers

5 Answers5

Linked

Related