Those instructions aren't equivalent. For LEA, rax is a pure output. For your hoped-for add, it's rax += rsi so the compiler would have to mov %rdi, %rax first. That's less efficient so it doesn't do that.
lea is a totally normal way for compilers to implement dst = src1 + src2, saving a mov instruction. In general don't expect C operators to compile to instruction named after them. Especially small left-shifts and add, or multiply by 3, 5, or 9, because those are prime targets for optimization with LEA. e.g. lea (%rsi, %rsi, 2), %rax implements result = y*3. See Using LEA on values that aren't addresses / pointers? for more. LEA is also useful to avoid destroying either of the inputs, if they're both needed later.
Assuming you meant t3 to be the same variable as temp3, clang does compile the way you were expecting, doing a better job of register allocation so it can use a shorter and more efficient add instruction without any extra mov instructions, instead of needing lea.
Clang chooses to do better register allocation than GCC so it can just use add instead of needing lea for the last instruction. (Godbolt). This saves code-size (because of the indexed addressing mode), and add has slightly better throughput than LEA on most CPUs, like 4/clock instead of 2/clock.
Clang also optimized the shifts into andl $1, %eax / negq %rax to create the 0 or -1 result of that arithmetic right shift = bit-broadcast. It also optimized to 32-bit operand-size for the first few steps because the shifts throw away all but the low bit of temp1.
# side by side comparison, like the Godbolt diff pane
clang: | gcc:
movl %edi, %eax movq %rdi, %rax
subl %edx, %eax subq %rdx, %rdi
imull %edi, %eax imulq %rax, %rdi # temp1
andl $1, %eax salq $63, %rdi
negq %rax sarq $63, %rdi # temp2
xorq %rdi, %rax xorq %rax, %rdi # temp3
addq %rsi, %rax leaq (%rdi,%rsi), %rax # answer
retq ret
Notice that clang chose imul %edi, %eax (into RAX) but GCC chose to multiply into RDI. That's the difference in register allocation that leads to GCC needing an lea at the end instead of an add.
Compilers sometimes even get stuck with an extra mov instruction at the end of a small function when they make poor choices like this, if the last operation wasn't something like addition that can be done with lea as a non-destructive op-and-copy. These are missed-optimization bugs; you can report them on GCC's bugzilla.
Other missed optimizations
GCC and clang could have optimized by using and instead of imul to set the low bit only if both inputs are odd.
Also, since only the low bit of the sub output matters, XOR (add without carry) would have worked, or even addition! (Odd+-even = odd. even+-even = even. odd+-odd = odd.) That would have allowed an lea instead of mov/sub as the first instruction.
lea (%rdi,%rsi), %eax
and %edi, %eax # low bit matches (x-z)*x
andl $1, %eax # keep only the low bit
negq %rax # temp2
Lets make a truth table for the low bits of x and z to see how this shakes out if we want to optimize more / differently:
# truth table for low bit: input to shifts that broadcasts this to all bits
x&1 | z&1 | x-z = x^z | x*(x-z) = x & (x-z)
0 0 0 0
0 1 1 0
1 0 1 1
1 1 0 0
x & (~z) = BMI1 andn
So temp2 = (x^z) & x & 1 ? -1 : 0. But also temp2 = -((x & ~z) & 1).
We can rearrange that to -((x&1) & ~z) which lets us start with not z and and $1, x in parallel, for better ILP. Or if z might be ready first, we could do operations on it and shorten the critical path from x -> answer, at the expense of z.
Or with a BMI1 andn instruction which does (~z) & x, we can do this in one instruction. (Plus another to isolate the low bit)
I think this function has the same behaviour for every possible input, so compilers could have emitted it from your source code. This is one possibility you should wish your compiler emitted:
# hand-optimized
# long someFunc(long x, long y, long z)
someFunc:
not %edx # ~z
and $1, %edx
and %edi, %edx # x&1 & ~z = low bit of temp1
neg %rdx # temp2 = 0 or -1
xor %rdi, %rdx # temp3 = x or ~x
lea (%rsi, %rdx), %rax # answer = y + temp3
ret
So there's still no ILP, unless z is ready before x and/or y. Using an extra mov instruction, we could do x&1 in parallel with not z
Possibly you could do something with test/setz or cmov, but IDK if that would beat lea/and (temp1) + and/neg (temp2) + xor + add.
I haven't looked into optimizing the final xor and add, but note that temp3 is basically a conditional NOT of x. You could maybe improve latency at the expense of throughput by calculating both ways at once and selecting between them with cmov. Possibly by involving the 2's complement identity that -x - 1 = ~x. Maybe improve ILP / latency by doing x+y and then correcting that with something that depends on the x and z condition? Since we can't subtract using LEA, it seems best to just NOT and ADD.
# return y + x or y + (~x) according to the condition on x and z
someFunc:
lea (%rsi, %rdi), %rax # y + x
andn %edi, %edx, %ecx # ecx = x & (~z)
not %rdi # ~x
add %rsi, %rdi # y + (~x)
test $1, %cl
cmovnz %rdi, %rax # select between y+x and y+~x
retq
This has more ILP, but needs BMI1 andn to still be only 6 (single-uop) instructions. Broadwell and later have single-uop CMOV; on earlier Intel it's 2 uops.
The other function could be 5 uops using BMI andn.
In this version, the first 3 instructions can all run in the first cycle, assuming x,y, and z are all ready. Then in the 2nd cycle, ADD and TEST can both run. In the 3rd cycle, CMOV can run, taking integer inputs from LEA, ADD, and flag input from TEST. So the total latency from x->answer, y->answer, or z->answer is 3 cycles in this version. (Assuming single-uop / single-cycle cmov). Great if it's on the critical path, not very relevant if it's part of an independent dep chain and throughput is all that matters.
vs. 5 (andn) or 6 cycles (without) for the previous attempt. Or even worse for the compiler output using imul instead of and (3 cycle latency just for that instruction).