4

I am trying to compare the methods mentioned by Peter Cordes in his answer to the question that 'set all bits in CPU register to 1'.

Therefore, I write a benchmark to set all 13 registers to all bits 1 except e/rsp, e/rbp, and e/rcx.

The code is like below. times 32 nop is used to avoid DSB and LSD influence.

mov ecx, 100000000
Align 32
.test3:
    times 32 nop
    mov rax,-1
    mov rbx,-1
    ;mov ecx,-1
    mov rdx,-1
    mov rdi,-1
    mov rsi,-1
    mov r8,-1
    mov r9,-1
    mov r10,-1
    mov r11,-1
    mov r12,-1
    mov r13,-1
    mov r14,-1
    mov r15,-1

    dec ecx
    jge .test3
    jmp .out

I test below methods he mentioned, and Full code in here

mov e/rax, -1                   

xor eax, eax        
dec e/rax               

xor ecx, ecx        
lea e/rax, [rcx-1]  

or e/rax, -1            

To make this question more concise, I will use group1 a (g1a) to replace mov eax,-1 in the below tables.

number pattern test number
group1 a mov eax,-1 test 7
group1 b mov rax,-1 test3
group2 a xor eax, eax / dec eax test6
group2 b xor eax, eax / dec rax test2
group3 a xor ecx, ecx / lea eax, [rcx-1] test0
group3 b xor ecx, ecx / lea rax, [rcx-1] test-1(test00)
group4 a or eax,-1 test5
group4 b or rax,-1 test1

The table below shows that from group 1 to group 3, when using 64 bit registers, there is 1 more cycle per loop.

The IDQ_UOPS_NOT_DELIVERED also increases, which may explain the growing number of cycles. But can this explain the exact 1 more cycle per loop?

cycles MITE cycles(r1002479) MITE 4uops cycles (r4002479) IDQ UOPS NOT DELIVERED(r19c)
g1a 1,300,903,705 1,300,104,496 800,055,137 601,487,115
g1b 1,400,852,931 1,400,092,325 800,049,313 1,001,524,712
g2a 1,600,920,156 1,600,113,480 1,300,061,359 501,522,554
g2b 1,700,834,769 1,700,108,688 1,300,057,576 901,467,008
g3a 1,701,971,425 1,700,093,298 1,300,111,482 902,327,493
g3b 1,800,891,861 1,800,110,096 1,300,059,338 1,301,497,001
g4a 1,201,164,208 1,200,122,275 1,100,049,081 201,592,292
g4b 1,200,553,577 1,200,074,422 1,100,031,729 200,772,985

Besides, the port distribution of g2a and g2b is different, unlike g1a and g1b (g1a is the same as g1b in port distribution), or g3a and g3b.

And if I comment times 32 nop, this phenomenon disappears. Is it related to MITE?

p0 p1 p2 p3 p4 p5 p6 p7
g1a 299,868,019 300,014,657 5,925 7,794 16,589 300,279,232 499,885,294 7,242
g1b 299,935,968 300,085,089 6,622 8,758 18,842 299,935,445 500,426,436 7,336
g2a 299,800,192 299,758,460 7,461 9,635 20,622 399,836,486 400,312,354 8,446
g2b 200,047,079 200,203,026 7,899 9,967 21,539 500,542,313 500,296,034 9,635
g3a 36,568 550,860,773 7,784 10,147 22,538 749,063,082 99,856,623 9,767
g3b 36,858 599,960,197 8,232 10,763 23,086 700,499,893 100,078,368 9,513
g4a 200,142,036 300,600,535 5,383 6,705 15,344 400,045,302 500,364,377 6,802
g4b 200,224,703 300,284,609 5,464 7,031 15,817 400,047,050 499,467,546 6,746

Environment: intel i7-10700, ubuntu 20.04, and NASM 2.14.02.

It is a little bit hard for me to explain this in English. Please comment if the description is unclear.

moep0
  • 358
  • 1
  • 8
  • what is the question? are you trying to measure the difference between shorter and longer instructions? – old_timer Nov 27 '21 at 03:41
  • *`times 32 nop` is used to avoid DSB and LSD influence.* - and mean that you're benchmarking the legacy decoders (MITE), because this bottlenecks on the front-end. Especially with long instructions like 7-byte `mov rdx,-1` or 5-byte `mov edx,-1`. You tagged [intel], but what specific CPU did you use? Skylake-derived? I'm guessing not an E-core on Alder Lake; they have wider decode and mark instruction boundaries in L1I cache, while SnB-family CPUs fetch in 16-byte blocks for legacy-decode. See Agner's microarch pdf on https://agner.org/optimize/ – Peter Cordes Nov 27 '21 at 03:59
  • The general title is mostly a duplicate of [The advantages of using 32bit registers/instructions in x86-64](https://stackoverflow.com/q/38303333). IDK how specific an answer you're looking for about exactly what decode bottlenecks you've created with longer or shorter instructions, but pretty obviously using longer instructions will cost throughput when the average length is >= 4 or so, although SKL and later having 5 decoders can make up for that some thanks to buffering between decode and issue/rename. (Build up some cushion decoding 5 nops / clock, then eat into it when producing less) – Peter Cordes Nov 27 '21 at 05:09
  • @PeterCordes Environment: intel i7-10700, ubuntu 20.04, and NASM 2.14.02. I know that longer instructions will cost throughput, but whether the one more cycle per loop is a coincidence? Besides, group 2 presents a different port distribution. I am wondering what causes it. – moep0 Nov 27 '21 at 06:07
  • @old_timer The first question is that whether the exact one more cycle per loop between 32bit registers and 64bit registers is a coincidence? The second question is that whether 32bit registers and 64bit registers will make port distribution different? – moep0 Nov 27 '21 at 06:09
  • https://github.com/moep0/relativeCode/tree/main/2021/1126 is a broken link. Is the repo private, perhaps? – Peter Cordes Nov 27 '21 at 06:11
  • @PeterCordes Sorry I forgot. make it public now. – moep0 Nov 27 '21 at 06:13
  • Your loop doesn't do any memory access, so any counts for p2/3 are not from your code. Presumably from interrupt handlers; use `perf stat --all-user` (or with older perf, use `-e cycles:u,uops_dispatched_port.port_0:u` etc.) But this test is simple enough that the noise from interrupt handlers isn't significant, no need to re-run your tests if you don't have another reason to do so. – Peter Cordes Nov 27 '21 at 06:22
  • Ok, from your github link, your actual group2 code is repeating the xor every time as well. So the table that says it's just `dec` is wrong / misleading. Also, more than half the registers used are r8d..r15d, so both instructions need REX prefixes regardless of operand-size. (3 bytes per instruction). (Leaving RBP unused, not just RSP). xor-zeroing is dep-breaking so you could repeat a few registers, although that would make it different from OR. Also in test0 / test00, you redundantly zero registers each time before LEA unlike the table shows, instead of zeroing once outside the loop. – Peter Cordes Nov 27 '21 at 06:30
  • @PeterCordes In fact I have added `--all-user`. Maybe it is because I use `sudo`. And I think this doesn't affect the previous conclusion. And yes, the pattern in the table is an abbreviation for the xor/dec. – moep0 Nov 27 '21 at 06:31
  • MITE decode of `dec` is probably slower because the decoders don't like to end with a potentially macro-fusable uop in the last decoder. They keep it around until the next cycle in case the next instruction is a `jcc`. Instruction-lengths may affect where decode boundaries fall. IDK exactly why this would have an impact on which port they get scheduled to, but maybe the back-end is fully draining more often with lower throughput, and maybe issue/rename/alloc prefers to send dec uops to p5/p6, leaving p0/p1 free for FP / multiply uops? – Peter Cordes Nov 27 '21 at 06:34
  • If there are still uops in the back-end when more are scheduled, [How are x86 uops scheduled, exactly?](https://stackoverflow.com/q/40681331) explains Intel's algorithm of counters that schedule to the port with fewest waiting uops. – Peter Cordes Nov 27 '21 at 06:37
  • @PeterCordes Oh thank you. The link seems to partially solve the second question. There only remains the question that why the 32bit registers cost one cycle less (per loop) than 64bit registers. – moep0 Nov 27 '21 at 06:50
  • `sudo` shouldn't be creating extra counts for `uops_dispatched_port.port_7`. I tried it with/without sudo on my machine, and it was still non-zero surprisingly, but only 196 to 260 per run, not thousands. (i7-6700k Skylake, Arch Linux, kernel 5.12, perf 5.13.) I set `perf_event_paranoid` = 0 on my machine so I can use perf without sudo, higher numbers still work if you only care about counting in user-space. – Peter Cordes Nov 27 '21 at 07:03
  • I also changed the command to use names instead of numbers, I think I got the right ones: `taskset -c 1 perf stat --all-user -e task-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,idq.all_mite_cycles_4_uops,idq_uops_not_delivered.core,idq.mite_cycles` – Peter Cordes Nov 27 '21 at 07:03
  • (I didn't bother changing the runtime branching to select a test; you could have used `-Dtarget=.test$number` to build, and `jmp target` since you're making it a compile-time constant. It's text macro stuff like CPP, not just a numeric constant.) – Peter Cordes Nov 27 '21 at 07:10
  • Anyway, I can reproduce your cycle counts for each test. Interesting that they're an integer multiple. I guess that tells us something about where the bottleneck is, maybe in the pre-decode stage of contiguous groups of instructions, not in the decoders themselves when machine-code bytes after the taken branch could decode in the same cycle. The MITE perf events are about how it delivers uops (or not), not about its internals like ILD; only `ild_stall.lcp` says anything about that. – Peter Cordes Nov 27 '21 at 07:11
  • @PeterCordes Cool tips that I will use next time. I also guess the MITE leads to such a difference. But I don't know which part of MITE lead to such a difference because there is no PMU that can directly observe the decoders' workload or fetch window. – moep0 Nov 27 '21 at 07:24
  • Right, to understand more about MITE, read Agner Fog's microarch PDF, especially the Sandybridge section has details about how the decoders work. SKL didn't change things much. https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Front-end does cover it specifically. (And probably mostly accurately, although I expect there are some details in that wikichip article that aren't precisely right.) – Peter Cordes Nov 27 '21 at 07:33
  • @PeterCordes Sorry I just don't understand the reasoning process of the conclusion that the bottleneck is in the pre-decode stage of contiguous groups of instructions. Is it a conjecture? – moep0 Nov 27 '21 at 07:42
  • 1
    Oh, I see. Pre-decode is limited to looking at 16 bytes per cycle, and perhaps only from contiguous fetch blocks. (Or maybe fetch itself is a bottleneck, but the queue between it and pre-decode so the NOPs should give it some time to catch up.) Branch prediction may let the CPU paste together parts of different fetch blocks into one 16-byte pre-decode group. But the actual decoders themselves can I think look at more total bytes if there are enough in the queue. With large average instruction lengths, it's often pre-decode that's the problem. – Peter Cordes Nov 27 '21 at 08:03
  • Also, if there are more than 6 instructions in a 16-byte block, like will happen with 2-byte instructions that don't macro-fuse like `xor`/`dec` without REX prefixes, the left-over instructions are decoded alone, not as the start of a new 16-byte window. So you'd get a 6 / 2 pattern in a block of 2-byte instructions. That's 4/clock average, so it's ok but doesn't help "get ahead" or catch up by filling the instruction queue faster than the decoders can drain it. (OTOH, the 32x 1-byte NOPs do build up some buffer, pre-decoding in 6/6/4 patterns, averaging 5.33 instructions / clock.) – Peter Cordes Nov 27 '21 at 08:10
  • @PeterCordes OK, I see. Thanks for your detailed explanation. – moep0 Nov 27 '21 at 08:31
  • 1
    @PeterCordes Skylake has 4 decoders (that can deliver up to 5 uops per cycle to the IDQ), and it can predecode at most 5 instructions per cycle. – Andreas Abel Nov 27 '21 at 19:55

1 Answers1

3

The bottleneck in all of your examples is the predecoder.

I analyzed your examples with my simulator uiCA (https://uica.uops.info/, https://github.com/andreas-abel/uiCA). It predicts the following throughputs, which closely match your measurements:

The trace table that uiCA generates provides some insights into how the code is executed. For g1a, for example, it generates the following trace: Trace for g1a

You can see that for the 32 nops, the predecoder requires 8 cycles, and for the remaining instructions, it requires 5 cycles, which together corresponds to the 13 cycles that you measured.

You may notice that in some cycles, only a small number of instructions is predecoded; for example, in the fourth cycle, only one instruction is predecoded. This is because the predecoder works on aligned 16-byte blocks, and it can handle at most five instructions per cycle (note that some sources incorrectly claim that it can handle 6 instructions per cycle). You can find more details on the predecoder, for example how it handles instructions that cross a 16-byte boundary, in this paper.

If you compare this trace with the trace for g1b, enter image description here you can see that the instructions after the nops now require 6 instead of 5 cycles to be predecoded, which is because several of the instructions in g1b are longer than the corresponding ones in g1a.

Andreas Abel
  • 1,376
  • 1
  • 10
  • 21
  • Great explanation and cooool simulator! In the results you link, g2a and g2b actually choose different ports. How do you simulate that?(I haven't read your paper yet, maybe later.) – moep0 Nov 29 '21 at 01:32
  • I have read 2.12 of your paper. Can this explain why `dec edi` goes to port 1 but `dec rdi` goes to port 0? – moep0 Nov 29 '21 at 01:43
  • @moep0 Yes, `dec edi` uses issue slot 0, whereas `dec rdi` uses issue slot 1, which explains the different port usage. I'm not sure about g2a and g2b, I would need to look into this. – Andreas Abel Nov 29 '21 at 15:16