1

There's this assembly excerpt :

movl  (%ecx,%edx,4), %eax
... 
incl    %eax
movl    %eax, (%ecx,%edx,4)

Which translates into c :

a[i] += 1;

With:

a -> ecx and i -> edx

My question is what's the point of using eax as a middleman ? Is it faster or is it impossible to increment the memory value directly ?

  • Possibly you are generating unoptimized code with the compiler (GCC?). Try compiling with `-O3` – Michael Petch Jun 03 '17 at 14:39
  • It is very likely faster to do it this way, "using eax as a middleman". Breaking up complex CISC-style instructions (that would decompose to a series of load and store µops) into multiple, simpler RISC-style instructions has long been a well-known optimization technique for x86 processors. That said, it is possible to increment memory directly: `incl (%ecx,%edx,4)`, and either you or the compiler could have generated that code. So this really just depends on your compiler's code-gen strategy and any optimization switches you may have set. – Cody Gray - on strike Jun 03 '17 at 14:43
  • [Here is an older article that delves into the relevance and merits of this optimization technique](http://www.emulators.com/docs/nx06_rmw.htm). Not everything he says there is 100% true, and today, you can pretty much ignore the difference between INC and ADD, as the Pentium 4 is obsolete. But still, you can see that the choice here is a rather complicated and debatable issue. Not being sure of your skill level, it would be very hard to write a good answer to this question. Maybe you could elaborate a bit on your motivations for asking? For now, I've voted to close as "too broad". – Cody Gray - on strike Jun 03 '17 at 14:53
  • 1
    In both case, for Skylake, the latencies are the same: 5 cycles. However, the single instruction approach takes 4uops while the three instructions one takes 5uops (and it is longer). The single instruction put less pressure on port4 (the only one doing stores in Skylake). All the mainstream compilers [seem to prefer the one instruction approach](https://godbolt.org/g/hgpCbB). – Margaret Bloom Jun 03 '17 at 16:57
  • Without context (how that assembly excerpt was created) it is impossible to answer this. It's probably slower than incrementing value directly, then again if you are after performance, it would be very likely possible to create better code than incrementing single element of array by one, so it's quite pointless to reason about performance of 3 assembly instructions. If the algorithm is hopeless, or data structures are not optimal, 3 vs 1 instructions is marginal difference. – Ped7g Jun 04 '17 at 16:09
  • I already voted to close this as "too broad", so I can't close it as a duplicate of [this question](https://stackoverflow.com/questions/38034498/are-rmw-instructions-considered-harmful-on-modern-x86), which is really more accurate. If someone else comes by who hasn't cast a vote, they could mark it as a duplicate. – Cody Gray - on strike Jun 05 '17 at 09:20

1 Answers1

0

Thanks for the answers, from what I've gathered both ways are valid and it's up to the compiler and the optimization level.

I'm writing this to mark the question answered.