Visual Studio 2010 - 2015 does not use ymm* registers for AVX optimization

Question

My laptop CPU supports only AVX (advanced vector extension) but does not support AVX2. For AVX, the 128-bit xmm* registers have already been extended to the 256-bit ymm* registers for floating point arithmetic. However, I have tested that all versions of Visual Studio (from 2010 to 2015) do not use ymm* registers under /arch:AVX optimization, although they do so under /arch:AVX2 optimization.

The following shows the disassembly for a simple for loop. The program is compiled with /arch:AVX in release build, with all optimization options on.

    float a[10000], b[10000], c[10000];
    for (int x = 0; x < 10000; x++)
1000988F  xor         eax,eax  
10009891  mov         dword ptr [ebp-9C8Ch],ecx  
        c[x] = (a[x] + b[x])*b[x];
10009897  vmovups     xmm1,xmmword ptr c[eax]  
100098A0  vaddps      xmm0,xmm1,xmmword ptr c[eax]  
100098A9  vmulps      xmm0,xmm0,xmm1  
100098AD  vmovups     xmmword ptr c[eax],xmm0  
100098B6  vmovups     xmm1,xmmword ptr [ebp+eax-9C78h]  
100098BF  vaddps      xmm0,xmm1,xmmword ptr [ebp+eax-9C78h]  
100098C8  vmulps      xmm0,xmm0,xmm1  
100098CC  vmovups     xmmword ptr [ebp+eax-9C78h],xmm0  
100098D5  add         eax,20h  
100098D8  cmp         eax,9C40h  
100098DD  jl          ComputeTempo+67h (10009897h)  


    const int   winpts = (int)(window_size*sr+0.5);
100098DF  vxorps      xmm1,xmm1,xmm1  
100098E3  vcvtsi2ss   xmm1,xmm1,ecx

I have also tested that I can use ymm* registers to further speed up my program without crashing. I did that using IMM intrinsics, e.g. _mm256_mul_ps.

Can any Microsoft compiler developer give an explanation? Or maybe that is one of the reasons why Visual Studio gives slower codes than gcc/g++ compiler?

=============edited==============

The reason turns out to be that there exist some difference between running 32-bit OS on 32-bit machine and running 32-bit OS on 64-bit machine. In the latter case, some OS might not know the existence of ymm* registers and thus does not preserve the upper half registers properly during a context switch. Thus, if ymm* registers are used on 32-bit OS on 64-bit machine, if a context switch occurs, the upper half registers might get silently corrupted if another program is also using ymm* registers. Visual Studio is kind of conservative in this context.

Did you try a loop where the compiler knows the arrays are 32B-aligned? I notice it's using unaligned load/store instructions. Also, AMD CPUs do worse with 256b AVX code than with 128b AVX code, esp. Piledriver has huge issues with 256b stores. So if you didn't tell the compiler to optimize for a specific microarchitecture, 128b vectors are "safer". — Peter Cordes, Jan 14 '16 at 06:33
I tested void `void foo(float *a, float *b, float *c) { for(int i=0; i<10000; i++) c[i] = (a[i]+b[i])*b[i]; }` in MSVC 2015 with `cl /c /O2 /arch:AVX` and it uses `ymm`. I don't know what problem you are having. — Z boson, Jan 14 '16 at 08:13
@PeterCordes, there is no penalty to using unaligned load instructions with AVX. There is a penalty (but not a big one) for the memory not being 32B-aligned but both Clang and MSVC don't adjust for that (but GCC and ICC do). — Z boson, Jan 14 '16 at 08:19
Incidentally `__restrict` (`void foo(float * __restrict a, float * __restrict b, float * __restrict c) for (int i = 0; i < 10000; i++) c[i] = (a[i] + b[i])*b[i]; }`makes no difference in this case with MSVC but it makes a big difference with Clang and GCC. — Z boson, Jan 14 '16 at 08:23
I figured out the problem. You are compiling in 32-bit mode. Visual Studio defaults to 32-bit mode. — Z boson, Jan 14 '16 at 08:36

Z boson · Accepted Answer · 2016-01-14T08:49:30.490

4

I made a text file vec.cpp

//vec.cpp
void foo(float *a, float *b, float *c) {
    for (int i = 0; i < 10000; i++) c[i] = (a[i] + b[i])*b[i];
}

went to the command line with Visual Studio 2015 x86 x64 enabled and did

cl /c /O2 /arch:AVX /FA vec.cpp

looked at the file vec.asm and I see

$LL4@foo:
    vmovups ymm0, YMMWORD PTR [rax-32]
    lea rax, QWORD PTR [rax+64]
    vmovups ymm2, ymm0
    vaddps  ymm0, ymm0, YMMWORD PTR [rcx+rax-96]
    vmulps  ymm2, ymm0, ymm2
    vmovups YMMWORD PTR [r8+rax-96], ymm2
    vmovups ymm0, YMMWORD PTR [rax-64]
    vmovups ymm2, ymm0
    vaddps  ymm0, ymm0, YMMWORD PTR [rcx+rax-64]
    vmulps  ymm2, ymm0, ymm2
    vmovups YMMWORD PTR [r8+rax-64], ymm2
    sub rdx, 1
    jne SHORT $LL4@foo
    vzeroupper

The problem is that you are compiling in 32-bit mode. Using the same function above but compiling in 32-bit mode I get

$LL4@foo:
    lea eax, DWORD PTR [ebx+esi]
    lea ecx, DWORD PTR [ecx+32]
    lea esi, DWORD PTR [esi+32]
    vmovups xmm1, XMMWORD PTR [esi-48]
    vaddps  xmm0, xmm1, XMMWORD PTR [ecx-32]
    vmulps  xmm0, xmm0, xmm1
    vmovups XMMWORD PTR [edx+ecx-32], xmm0
    vmovups xmm1, XMMWORD PTR [esi-32]
    vaddps  xmm0, xmm1, XMMWORD PTR [eax]
    vmulps  xmm0, xmm0, xmm1
    vmovups XMMWORD PTR [eax+edx], xmm0
    sub edi, 1
    jne SHORT $LL4@foo

edited Jan 14 '16 at 08:49

answered Jan 14 '16 at 08:31

Z boson

32,619
11
123
226

1

Well, but why doesn't the compiler use the YMM registers when generating 32-bit code? YMM0-YMM7 are certainly available in x86-32 mode. – Cody Gray - on strike Jan 14 '16 at 10:20
@CodyGray, I have no idea. Another question: why does Visual Studio default to 32-bit mode even when the OS is 64-bit? It's annoying to have to go to configuration manager after creating a new project and tell it to use x64. I don't fine the GUI coding of Visual Studio that convenient so I mostly use the command line ([this helped](https://www.youtube.com/watch?v=Ee3EtYb8d1o)) anyway now. I guess the debugger of Visual Studio is good but I use printf and assembly to debug still anyway. – Z boson Jan 14 '16 at 10:33
Your second question is easy to answer---lots of applications are still compiled for 32-bit for maximum compatibility. Not everyone has 64-bit processors and/or is running 64-bit operating systems. Windows itself still has a 32-bit edition that won't support 64-bit apps. – Cody Gray - on strike Jan 14 '16 at 10:39
This doesn't explain *why* MSVC doesn't use ymm regs for 32bit targets, which is the main question (like @Cody said). Perhaps it doesn't want to assume OS support for AVX? Running a crusty old OS on new hardware would mean context switches would only save/restore the xmm regs, not ymm. (There are no CPUs that support AVX but not 64bit. IIRC, pre-silvermont Atom is 32bit-only. Some people unfortunately end up with 32bit windows on hardware that supports 64bit, though.) Anyway, so this might be an attempt at compatibility with OSes that don't know about AVX. – Peter Cordes Jan 14 '16 at 15:47
See also http://stackoverflow.com/q/34069054/224132: Apparently if you use ymm regs on a Windows version that doesn't know about AVX, the upper halves will be silently corrupted on context switches, rather than causing an illegal-insn exception. Also, I wondered if MSVC is maybe trying to avoid performance penalties from mixing SSE and AVX code without `VZEROUPPER`, but that doesn't make sense because it could just run `VZEROUPPER`. – Peter Cordes Jan 14 '16 at 15:48
1

@PeterCordes, I don't read in the question anything about 32bit targets. I think the best lesson for the OP is to use 64-bit mode from now on. But I agree my answer would be better if I could explain the issue with 32-bit mode. You make some good guesses but they are just speculation. Can you explain why MSVC would use xmm for AVX but ymm for AVX2 in 32-bit mode like the OP said (I have not tested this but I assume the OP used 32-bit mode for both)? Maybe we would have to ask MSVC why they did this? – Z boson Jan 14 '16 at 19:16
@PeterCordes, actually I just checked it to make sure. It uses xmm for `/arch:AVX` but ymm for `/arch:AVX2` in 32-bit mode. Why is this? – Z boson Jan 14 '16 at 19:22
1

The OP asks why MSVC doesn't use ymm registers in this case. The OP hadn't figured out that 32bit vs. 64bit was a factor, so good find with that. However, it's not a satisfactory explanation for *why* MSVC would be designed that way. I'm looking at the OP's question as "why does MSVC with these options not use ymm regs?", when one of the options is 32bit mode. The fact that it does use ymm regs in 64bit mode is **more weird**, not less. Also, I have no idea why AVX2 would make a difference, except that there's even more perf benefit from using 256b ops on newer CPUs like Haswell. – Peter Cordes Jan 14 '16 at 20:56
Oh, also, AVX2 excludes AMD Bulldozer-family CPUs. So perhaps this is a performance-tuning decision, to avoid potential performance pitfalls on Piledriver. (And the generally slower performance from decode bottlenecks on Bulldozer / Steamroller, since a series of 2 m-op instructions can only decode at one per clock.) The Piledriver 256b store perf bug could be worked around with `vextractf128 [mem+16], ymm, 1` / `vmovups [mem], xmm`, though. gcc does this for unaligned stores and loads when tuning for SnB/IvB (`-mavx256-split-unaligned-store`), but not for Haswell. – Peter Cordes Jan 14 '16 at 21:01
@PeterCordes, I don't think that using xmm registers with vex encoding only in 32-bit mode is a good solution for AMD processors. Especially since AMD created x86-64. So if I had to guess that's not why MSFT did this. – Z boson Jan 14 '16 at 21:13
@Zboson: VEX-encoded xmm instructions are probably the best bet for AMD Bulldozer-family CPUs, compared to VEX-encoded ymm instructions. 64bit vs. 32bit is an orthogonal choice here. Of course 64bit code with VEX-encoded xmm instructions is better than similar 32bit code. I'm speculating that MSVC is for some reason making different tuning decisions (caring more about perf downsides on Piledriver?) for 32bit vs. 64bit code. You're right that it doesn't seem to make sense, but maybe someone updated the tuning settings for one but not the other? – Peter Cordes Jan 14 '16 at 21:21
Pure speculation on my part, based on no information whatsoever. I don't use MSVC at all, and have mostly only used Windows to play video games. – Peter Cordes Jan 14 '16 at 21:23
When I try this I get SSE code not AVX: $LC8@foo: lea eax, DWORD PTR [eax+4] vmovss xmm1, DWORD PTR [eax-4] vaddss xmm0, xmm1, DWORD PTR [ebx+eax-4] vmulss xmm0, xmm0, xmm1 vmovss DWORD PTR [eax+ebp-4], xmm0 sub ecx, 1 jne SHORT $LC8@foo – Superfly Jon Jul 21 '17 at 13:32

xuancong84 · Answer 2 · 2016-01-18T04:52:19.127

Yes, it was 32-bit/64-bit problem. Compiling in x64 mode does not have the problem. However, for some reason, my program has to be compiled in 32-bit mode as it was a plugin of some sort where only 32-bit is supported. Nonetheless, it is still contradictory that even in 32-bit mode, setting /arch:AVX2 will allow the compiler to access ymm* registers.

From Intel specification, http://www.felixcloutier.com/x86/ADDPS.html, it says that "in 64-bit mode, using a REX prefix in the form of REX.R permits this instruction to access additional registers (XMM8-XMM15)." Also in http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html, it is stated that 32-bit programs can access ymm* registers in both 32-bit and 64-bit operation systems. The only restriction is that in 32-bit mode, you don't have access to xmm8-xmm15 nor ymm8-ymm15 because the instructions are shorter. That is why I am able to manually use intrinsic functions to access the ymm* registers without causing an illegal instruction crash.

So in conclusion, unless there exists some CPUs that support only AVX but not AVX2, will encounter some problems accessing ymm* registers in 32-bit mode, (which has already been proven not to be the case), the above-mentioned restriction is not necessary. And I still hope Visual C++ compiler can be improved to make this optimization option available since many computers support only AVX but not AVX2, and using ymm* registers can double the performance of floating point arithmetic.

The normal procedure when someone answers your question is to accept the answer if you think it answers your question. You don't seem to be aware of this because you did not do it for @PaulR 's answer [here](http://stackoverflow.com/a/34586817/2542702) either. — Z boson, Jan 15 '16 at 08:41
Sorry, I am quite new to stackoverflow, I just found out that the big tick below the rank-up/rank-down is also click-able. Ticked!-:) — xuancong84, Jan 18 '16 at 04:49
No problem. I did not really fully answer your question so it would be okay if you did not accept the answer. You can also change an accepted answer in case somebody later answers your question more to your satisfactory. — Z boson, Jan 18 '16 at 08:43

Visual Studio 2010 - 2015 does not use ymm* registers for AVX optimization

2 Answers2

Linked