You seem to have the misconception that every used register gets saved somewhere. This is not the case. The very names "caller saved" and "callee saved" are terrible and misleading, based on a braindead model of code-generation, as well as not sounding very different and being hard to think about. See that link for more, but the key point is that call-clobbered aka volatile registers can just "die" without being saved/restored, when the value isn't needed after the call. (e.g. it was only computed as a function arg). There'd be no point to the caller actually storing it to memory and reloading it after.
Most functions don't need 31 values live in registers at all times, so it's fine to let some of them die across function calls.
Having some call-preserved registers saves significant static code-size because it means you don't have to write store/load instructions before / after every function call. Just once for the whole function. Only once inside the callee, if at all. Most functions are called from multiple call-sites; that's why they're functions instead of just getting inlined.
(A smart compiler doing link-time optimization will do this inlining for you if there was only one call site, so high-level software-engineering / maintenance reasons for having separate functions are mostly irrelevant when we're talking about asm for modern systems.)
Most non-leaf functions make multiple function calls, so saving/restoring a couple call-preserved registers around the whole function lets you keep values in them across each of calls your function makes. So you get more bang for your buck in terms of total instructions executed.
Also, in a loop calling a leaf function (makes no calls) that's fairly simple (doesn't need to touch any of the call-preserved registers to get enough scratch registers for its own purposes), neither the loop nor the callee need to do any spills / reloads. On an ISA with plenty of registers like RISC-V, a leaf function can do quite a bit with the generous number of scratch registers that exist. (So it can be big enough to justify not inlining even if it doesn't need any register save/restore). Of course virtual functions and other cases of indirection can also prevent inlining, leading to calls to smaller leaf functions.
Related re: efficiency of a calling convention, and the tradeoff between more vs. fewer scratch vs. call-preserved regs:
Examples:
From RISC-V clang 10.0 on the Godbolt compiler explorer, with -O3 full optimization. (Without optimization, compilers always keep variables in memory which would totally defeat the point.)
int bar(int x) { return x + (x<<1) - 2; }
bar(int):
addi a1, zero, 3 # note use of a1 as a scratch reg that wasn't an input
mul a0, a0, a1 # apparently clang tunes for very efficient mul
addi a0, a0, -2 # retval in a0
ret
If we'd had to save/restore a1 just to get some scratch space to compute a simple expression, that would have taken several extra instructions to move the stack pointer and store/reload. And assuming our caller didn't have anything it cared about in a1, it wouldn't have bothered saving/restoring it either.
int foo(int orig) {
int t = bar(10);
t = bar(t + orig);
return bar(t + orig);
}
foo(int):
addi sp, sp, -16
sw ra, 12(sp) # save link register
sw s0, 8(sp) # save a call-preserved reg
add s0, zero, a0 # and copy orig into it
addi a0, zero, 10
call bar(int) # t = bar(10) in a0
add a0, a0, s0 # a0 += orig
call bar(int) # t = bar(t + orig) in a0
add a0, a0, s0 # a0 += orig
lw s0, 8(sp)
lw ra, 12(sp) # restore s0 and ra
addi sp, sp, 16 # dealloc stack space
tail bar(int) # tail-call jump to bar(t + orig)
Notice that the t + orig temporary value "dies" at each function call. It's not available after because the caller doesn't need it, so doesn't save it anywhere. In this case it computed it in a0 so it gets replaced with the return value. If I'd used a more complicated expression, that might have involved leaving other intermediate values in a1, a2 or other registers that the calling convention also clobbers.
Even named C variables can be allowed to "die" if their value isn't needed later. Like if I'd done int t2 = bar(t + orig); and used that later, the value of t isn't needed so the code-gen could be identical. Modern compilers like clang/LLVM optimize by transforming your source into SSA form where there's basically no difference between overwriting an old variable or initializing a new variable. (Except in debug builds.)
This is fully compatible with the above definition of bar; it was compiler-generated by the same compiler for the same calling convention.
(Despite the fact they're in the same file so the compiler could see both, it isn't bending the calling convention into a custom convention for these two simple functions. If it was doing that instead of inlining, it would pass args to bar in different registers than the incoming arg to foo, so foo wouldn't have to save / restore s0. And maybe even use a different return-address register so foo could avoid reserving any stack space: RISC-V call is just an alias for jal with ra getting the return address. Of course for a simple function like this it's obviously better to just inline it, but I used __attribute__((noinline)) to force clang not to do that.)
Also included in the Godbolt link is a loop that does arr[i] = func(i);. (That func could be simple like bar(), only using scratch regs). As you can see, it saves some registers at the top of the looping function so it can have variables in registers in the loop.
test2:
# ... save regs and set up s0=i=0
# s1=pointer into array
# s2=n
.LBB2_2: # do {
add a0, zero, s0
call extfunc(int)
sw a0, 0(s1) # *p = retval
addi s0, s0, 1 # i++
addi s1, s1, 4 # p++
bne s2, s0, .LBB2_2 # }while(i != n)
# then some cleanup
So it takes a bunch of instructions before/after the loop, but those run once per function invocation. The loop body runs n times, so minimizing the instructions in it is approximately n times more valuable for performance. (Potentially more than n if store/reload would have created a store-forwarding latency bottleneck.)