1

I noticed in my JIT compiled program that there are a lot of cases where a register is tested against zero (for example: equal to, not equal to). The code generation would emit something like (below are opcodes):

 cmp    $0x0,%r14
    49 83 fe 00 
 je     0x000000000000307a
    0f 84 89 05 00 00 

For example, if I know that a certain register is always unused in my program, would it make sense to use that as a zero-register (initialise once) and do the compare against it in order to shrink the overall program size?

 cmp    %rdx,%rsi
    48 39 d6 

Or, from a modern x86 CPU internals perspective would that rather turn out to be slower to use the register as it does some internal magic on the 0 immediate used in the comparison? Are there other, more efficient approaches I've been missing in terms of encoding size and cycles?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
jblerks
  • 11
  • 2
  • My advice: don't be concerned with such minute details up front. Leave it to the compiler/jit'er to make reasonable decisions. Later on, you can profile your code and look at making local changes to improve performance where needed. – 500 - Internal Server Error Sep 25 '19 at 08:53
  • 2
    I'm mainly asking because I'm currently writing a JIT compiler (and I'm curious about the CPU internals on that matter). – jblerks Sep 25 '19 at 08:57
  • Still, pick the most obvious implementation first, then measure. Swapping out a compare with literal zero with a compare with zero register (if you have one) should be trivial enough for you to set up a speed comparison once you get to it. – 500 - Internal Server Error Sep 25 '19 at 09:07
  • Also, even if comparing to a register turns out to be faster, on Intel CPUs at least, registers tend to be a scarce resource, so setting aside one for holding a zero value for comparison is likely to cause bottleneck issues in other areas where you don't have that register for other uses. – 500 - Internal Server Error Sep 25 '19 at 09:09
  • Fair enough, my assumption was/is that this is a commonly known problem among compiler folks, similar as with the fact that other optimisations like `xor eax,eax` can be handled at register allocation stage without needing an execution unit. Therefore I was mainly wondering whether there are similar tricks compilers do in this area. – jblerks Sep 25 '19 at 09:13
  • The language for the JIT has less registers than x86, so the ones left can be used as temp/scratch ones from JIT side. – jblerks Sep 25 '19 at 09:15
  • 4
    Note that to compare against zero, the best way is to use the `test` instruction. and x86 is way too starved to waste a whole register for zero. – fuz Sep 25 '19 at 09:18
  • Thanks for the hint, makes sense! Thanks fuz! – jblerks Sep 25 '19 at 09:38
  • 1
    On Nehalem, compare against a constant register could actually be worse because of register-file read stalls. (Search for "register read stalls" in Agner Fog's microarch guide). Otherwise if you had to pick one of those two, `cmp` against a registers is better because of code-size. But fortunately you don't have to pick; just use `test same,same` to set flags the same as with `cmp $0, reg`. The actual compare itself is always going to be a single-cycle ALU instruction, and either way can macro-fuse with a following JCC. – Peter Cordes Sep 25 '19 at 10:42
  • 1
    If you're JITing a language that has fewer registers, ideally you'd be using those regs to optimize away stores/reloads in the source language, not just naively transliterating everything. But if your JIT compiler is currently not really optimizing, then sure globally keep some handy constants in x86 registers like `0` and maybe `4` (for `add $4, %reg` -> `add %reg, %reg`) for pointer increments) – Peter Cordes Sep 25 '19 at 10:47

0 Answers0