however the value in ymm0 is 5.
The bit-pattern in ymm0 is 1084227584. The float interpretation of that number is 5.0.
But you can print /x $xmm0.v4_int32 to see a hex representation of the bits in xmm0.
What's so special about the XMM registers? beside the SIMD instructions, are they the only type of registers to store floating points?
No, in asm everything is just bytes.
Some compilers will use an integer register to copy a float or double from one memory location to another, if not doing any computation on it. (Integer instructions are often smaller.) e.g. clang will do this: https://godbolt.org/z/76EWMY
void copy(float *d, float *s) { *d = *s; }
# clang8.0 -O3 targeting x86-64 System V
copy: # @copy
mov eax, dword ptr [rsi]
mov dword ptr [rdi], eax
ret
XMM/YMM/ZMM registers are special because they're the only registers that FP ALU instructions exist for (ignoring x87, which is only used for 80-bit long double in x86-64).
addsd xmm0, xmm1 (add scalar double) has no equivalent for integer registers.
Usually FP and integer data don't mingle very much, so providing a whole separate set of architectural registers allows more space for more data to be in registers. (Given the same instruction-encoding constraints, it's a choice between 16 FP + 16 GP integer vs. 16 unified registers, not vs. 32 unified registers).
Plus, a major microarchitectural benefit of a separate register file is that it can be physically close to the FP ALUs, while the integer register file can be physically close to the integer ALUs. For more, see Is there any architecture that uses the same register space for scalar integer and floating point operations?
are float and double values always stored as a floating point? can we never store them as a fixed point in C or assembly?
x86 compilers use float = IEEE754 binary32 https://en.wikipedia.org/wiki/Single-precision_floating-point_format. (And double = IEEE754 binary64). This is specified as part of the ABI.
Internally the as-if rule allows the compiler to do whatever it wants, as long as the final result is identical. (Or with -ffast-math, to pretend that FP math is associative, and assume NaN/Inf aren't possible.)
Compilers can't just randomly choose a different object representation for some float that other separately-compiled functions might look at.
There might be rare cases for locals that are never visible to other functions where a "human compiler" (hand-writing asm to implement C) could prove that fixed-point was safe. Or more likely, that the float values were exact integers small enough that double wouldn't round them, so your fixed-point could degenerate to integer (except maybe for a final step).
But it would be rare to know this much about possible values without just being able to do constant propagation and optimize everything away. That's why I say a human would have to be involved, to prove things the compiler wouldn't know to look for.
I think in theory you could have a C implementation that did use a fixed-point float or double. ISO C puts very little restrictions on what float and double actually are.
But limits.h constants like FLT_RADIX and DBL_MAX_EXP have interactions that might not make sense for a fixed-point format, which has constant distance between each representable value, instead of being much closer together near 0 and much farther apart for large number. (Rounding error of 0.5ulp is relative to the magnitude, instead of absolute.)
Still, most programs don't actually do things that would break if the "mantissa" and exponent limits didn't correspond to what you'd expect for DBL_MIN and DBL_MAX.
Another interesting possibility is to make float and double based on the Posit format (similar to traditional floating-point, but with a variable-length exponent encoding. https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit-number/ https://posithub.org/index).
Modern hardware, especially Intel CPUs, has very good support for IEEE float/double, so fixed-point is often not a win. There are some nice SIMD instructions for 16-bit fixed-point, though, like high-half-only multiply, and even pmulhrsw which does fixed-point rounding.
But general 32-bit integer multiply has worse throughput than packed-float multiply. (Because the SIMD ALUs optimized for float/double only need 24x24-bit significand multipliers per 32 bits of vector element. Modern Intel CPUs run integer multiply and shift on the FMA execution units, with 2 uops per clock throughput.)