My laptop CPU supports only AVX (advanced vector extension) but does not support AVX2. For AVX, the 128-bit xmm* registers have already been extended to the 256-bit ymm* registers for floating point arithmetic. However, I have tested that all versions of Visual Studio (from 2010 to 2015) do not use ymm* registers under /arch:AVX optimization, although they do so under /arch:AVX2 optimization.
The following shows the disassembly for a simple for loop. The program is compiled with /arch:AVX in release build, with all optimization options on.
float a[10000], b[10000], c[10000];
for (int x = 0; x < 10000; x++)
1000988F xor eax,eax
10009891 mov dword ptr [ebp-9C8Ch],ecx
c[x] = (a[x] + b[x])*b[x];
10009897 vmovups xmm1,xmmword ptr c[eax]
100098A0 vaddps xmm0,xmm1,xmmword ptr c[eax]
100098A9 vmulps xmm0,xmm0,xmm1
100098AD vmovups xmmword ptr c[eax],xmm0
100098B6 vmovups xmm1,xmmword ptr [ebp+eax-9C78h]
100098BF vaddps xmm0,xmm1,xmmword ptr [ebp+eax-9C78h]
100098C8 vmulps xmm0,xmm0,xmm1
100098CC vmovups xmmword ptr [ebp+eax-9C78h],xmm0
100098D5 add eax,20h
100098D8 cmp eax,9C40h
100098DD jl ComputeTempo+67h (10009897h)
const int winpts = (int)(window_size*sr+0.5);
100098DF vxorps xmm1,xmm1,xmm1
100098E3 vcvtsi2ss xmm1,xmm1,ecx
I have also tested that I can use ymm* registers to further speed up my program without crashing. I did that using IMM intrinsics, e.g. _mm256_mul_ps.
Can any Microsoft compiler developer give an explanation? Or maybe that is one of the reasons why Visual Studio gives slower codes than gcc/g++ compiler?
=============edited==============
The reason turns out to be that there exist some difference between running 32-bit OS on 32-bit machine and running 32-bit OS on 64-bit machine. In the latter case, some OS might not know the existence of ymm* registers and thus does not preserve the upper half registers properly during a context switch. Thus, if ymm* registers are used on 32-bit OS on 64-bit machine, if a context switch occurs, the upper half registers might get silently corrupted if another program is also using ymm* registers. Visual Studio is kind of conservative in this context.