If both vmovaps and vmovapd instructions copy a whole XMM register into another, why do both of them exist?

Question

In the book Computer Systems: A Programmer's Perspective(CS:APP), It's mentioned that compiler generates assembly code for XMM registers copying from one to another using the instructions vmovaps and vmovapd.. and it's mentioned that those instructions copy the whole source XMM register into the destination XMM register. My question is, why do both of these instructions exist when both of them will just copy the source bit by bit to the destination? I don't see the significance of specifying whether we want to copy a single or a double?

This paragraph is present, and from which I speculate that the two instructions exist because it's left as microarchitecture-defined whether to copy the low-order single or double or just copy the whole register.

I don't know, but sometimes these sorts of redundant instructions exist in order to make the encoding consistent. There may be a whole bunch of opcodes that perform various operations on xmm registers, where one bit of the instruction specifies whether to operate on singles or doubles. One of those opcodes is `vmovap`, and it makes sense to have the same optional bit in the same place, to match the other instructions and probably make the decoding logic simpler. It's just that the bit has no effect. Likewise both versions are assigned mnemonics to be consistent. — Nate Eldredge, Jul 13 '20 at 03:27
I don't know of any way in which these instructions could copy only the low single/double; their definitions seem pretty clear that they move the whole register. It'd be a disaster for programmers if that behavior could mysteriously change. — Nate Eldredge, Jul 13 '20 at 03:29
@NateEldredge: Agreed, I think consistency of opcodes / decoding is the only reason for having `movapd` exist at all (until AVX-512 per-element masking), but that's only with an EVEX prefix, not the legacy SSE2 or AVX VEX-prefixed versions. Same with `orpd` vs. `orps` as per my answer [here](https://stackoverflow.com/questions/62111946/what-is-the-point-of-sse2-instructions-such-as-orpd), so I think this is an exact duplicate. — Peter Cordes, Jul 13 '20 at 03:55
@Tortellini: The *architectural* effect of those instructions is fully defined in all cases, only microarchitectural performance differences are allowed. The point the book is making is that you *could* use `movsd xmm1, xmm0` or `movaps xmm1, xmm0` for *scalar* code when you don't care about the high `double`. The 2nd one is faster and avoids an "output dependency" (because it's a merge into the destination, instead of just replacing the full destination), so it's better for OoO exec. — Peter Cordes, Jul 13 '20 at 03:58
Correction, `movsd xmm1, xmm0` is a merge; that's why it's slower than just a simple `movaps`. I mixed up my phrasing in the parenthetical in the last comment. — Peter Cordes, Jul 13 '20 at 17:11

If both vmovaps and vmovapd instructions copy a whole XMM register into another, why do both of them exist?

0 Answers0