Why make some registers caller-saved and others callee-saved? Why not make the caller save everything it wants saved?

Question

In this Wikipedia article about register preservation I read that the caller function is responsible for some registers (to keep their previous data from being changed) and the callee for others.

My question is why we have to complicate things? Why not just make all registers be the caller's responsibility for backing up existing values before calling a function and reloading them afterwards?

I don't see any performance gains from doing such steps, can someone help me understand?

Please edit the question to further explain what you are asking. Including detailed examples of a set of functions per the convention, and the same set based on your proposed convention. — old_timer, Jan 24 '21 at 11:46
Please cut and paste the relevant text from wikipedia (no links please). — old_timer, Jan 24 '21 at 11:46

Peter Cordes · Answer 1 · 2021-01-24T16:54:21.990

You seem to have the misconception that every used register gets saved somewhere. This is not the case. The very names "caller saved" and "callee saved" are terrible and misleading, based on a braindead model of code-generation, as well as not sounding very different and being hard to think about. See that link for more, but the key point is that call-clobbered aka volatile registers can just "die" without being saved/restored, when the value isn't needed after the call. (e.g. it was only computed as a function arg). There'd be no point to the caller actually storing it to memory and reloading it after.

Most functions don't need 31 values live in registers at all times, so it's fine to let some of them die across function calls.

Having some call-preserved registers saves significant static code-size because it means you don't have to write store/load instructions before / after every function call. Just once for the whole function. Only once inside the callee, if at all. Most functions are called from multiple call-sites; that's why they're functions instead of just getting inlined.

(A smart compiler doing link-time optimization will do this inlining for you if there was only one call site, so high-level software-engineering / maintenance reasons for having separate functions are mostly irrelevant when we're talking about asm for modern systems.)

Most non-leaf functions make multiple function calls, so saving/restoring a couple call-preserved registers around the whole function lets you keep values in them across each of calls your function makes. So you get more bang for your buck in terms of total instructions executed.

Also, in a loop calling a leaf function (makes no calls) that's fairly simple (doesn't need to touch any of the call-preserved registers to get enough scratch registers for its own purposes), neither the loop nor the callee need to do any spills / reloads. On an ISA with plenty of registers like RISC-V, a leaf function can do quite a bit with the generous number of scratch registers that exist. (So it can be big enough to justify not inlining even if it doesn't need any register save/restore). Of course virtual functions and other cases of indirection can also prevent inlining, leading to calls to smaller leaf functions.

Related re: efficiency of a calling convention, and the tradeoff between more vs. fewer scratch vs. call-preserved regs:

Why does Windows64 use a different calling convention from all other OSes on x86-64? - how the x86-64 System V ABI was designed, and the metrics they aimed to optimize with register choices, are an interesting case study.
Similarly Why not store function parameters in XMM vector registers? discusses the tradeoffs of more vs. fewer arg-passing registers, and why passing integer args in FP registers would be worse than just using the stack as a fallback.
What are callee and caller saved registers?

Examples:

From RISC-V clang 10.0 on the Godbolt compiler explorer, with -O3 full optimization. (Without optimization, compilers always keep variables in memory which would totally defeat the point.)

int bar(int x) { return x + (x<<1) - 2; }

bar(int):
        addi    a1, zero, 3         # note use of a1 as a scratch reg that wasn't an input
        mul     a0, a0, a1          # apparently clang tunes for very efficient mul
        addi    a0, a0, -2          # retval in a0
        ret

If we'd had to save/restore a1 just to get some scratch space to compute a simple expression, that would have taken several extra instructions to move the stack pointer and store/reload. And assuming our caller didn't have anything it cared about in a1, it wouldn't have bothered saving/restoring it either.

int foo(int orig) {
    int t = bar(10);
    t = bar(t + orig);
    return bar(t + orig);
}

foo(int):
        addi    sp, sp, -16
        sw      ra, 12(sp)           # save link register
        sw      s0, 8(sp)            # save a call-preserved reg
        add     s0, zero, a0         # and copy orig into it

        addi    a0, zero, 10
        call    bar(int)             # t = bar(10) in a0
        add     a0, a0, s0           # a0 += orig
        call    bar(int)             # t = bar(t + orig) in a0
        add     a0, a0, s0           # a0 += orig

        lw      s0, 8(sp)
        lw      ra, 12(sp)           # restore s0 and ra
        addi    sp, sp, 16           # dealloc stack space
        tail    bar(int)             # tail-call jump to bar(t + orig)

Notice that the t + orig temporary value "dies" at each function call. It's not available after because the caller doesn't need it, so doesn't save it anywhere. In this case it computed it in a0 so it gets replaced with the return value. If I'd used a more complicated expression, that might have involved leaving other intermediate values in a1, a2 or other registers that the calling convention also clobbers.

Even named C variables can be allowed to "die" if their value isn't needed later. Like if I'd done int t2 = bar(t + orig); and used that later, the value of t isn't needed so the code-gen could be identical. Modern compilers like clang/LLVM optimize by transforming your source into SSA form where there's basically no difference between overwriting an old variable or initializing a new variable. (Except in debug builds.)

This is fully compatible with the above definition of bar; it was compiler-generated by the same compiler for the same calling convention.

(Despite the fact they're in the same file so the compiler could see both, it isn't bending the calling convention into a custom convention for these two simple functions. If it was doing that instead of inlining, it would pass args to bar in different registers than the incoming arg to foo, so foo wouldn't have to save / restore s0. And maybe even use a different return-address register so foo could avoid reserving any stack space: RISC-V call is just an alias for jal with ra getting the return address. Of course for a simple function like this it's obviously better to just inline it, but I used __attribute__((noinline)) to force clang not to do that.)

Also included in the Godbolt link is a loop that does arr[i] = func(i);. (That func could be simple like bar(), only using scratch regs). As you can see, it saves some registers at the top of the looping function so it can have variables in registers in the loop.

test2:
   # ... save regs and set up s0=i=0
   #                          s1=pointer into array
   #                          s2=n
.LBB2_2:                                # do {
        add     a0, zero, s0
        call    extfunc(int)
        sw      a0, 0(s1)                 # *p = retval
        addi    s0, s0, 1                 # i++
        addi    s1, s1, 4                 # p++
        bne     s2, s0, .LBB2_2         # }while(i != n)
   # then some cleanup

So it takes a bunch of instructions before/after the loop, but those run once per function invocation. The loop body runs n times, so minimizing the instructions in it is approximately n times more valuable for performance. (Potentially more than n if store/reload would have created a store-forwarding latency bottleneck.)

Hi, can you kindly add simple RiscV code comparison, I didn't get what you wrote, in both cases we are saving same number of registers (those which are going to be used) — , Jan 24 '21 at 08:20
@john: No, you're not. Very often the caller doesn't need *all* the registers, and can just let some temporary values die at each call. Like if it computed some values that are only used as function args, the caller doesn't need them after the function returns, so computing them in volatile arg-passing registers and nobody saving them anywhere is totally fine. The callee will just use them and maybe overwrite them. — Peter Cordes, Jan 24 '21 at 15:39
@john: You're right, examples do illustrate this nicely. I could already picture examples like that in my head, so it's sometimes hard to remember how non-obvious that is for beginners. Anyway, I knew what kinds of things to put into a compiler to have it generate a few relevant examples, so I updated my answer. — Peter Cordes, Jan 24 '21 at 16:56

Why make some registers caller-saved and others callee-saved? Why not make the caller save everything it wants saved?

1 Answers1

Examples:

Linked