foo:
addi $t0,$zero,1
addi $v0,$zero,0
outer:
beq $t0,$zero,exitout
sll $t0,$t0,0
addi $t1,$zero,0
add $t2,$zero,$a0
addi $t0,$zero,0
inner: addi $t8,$a1,-1
slt $t9,$t1,$t8
beq $t9,$zero,outer
sll $t0,$t0,0
addi $v0,$v0,1
lw $t8,0($t2)
lw $t9,4($t2)
slt $t7,$t9,$t8
beq $t7,$zero,skip
sll $t0,$t0,0
lw $t8,0($t2)
lw $t9,4($t2)
sw $t9,0($t2)
sw $t8,4($t2)
addi $t0,$zero,1
skip: addi $t2,$t2,4
j inner
addi $t1,$t1,1
exitout:
jr $ra
sll $t0,$t0,0
Question What is the clock cycles per instruction (CPI) when executing foo with the following C code call: foo(lst,1) (that is the second argument is 1 instead of 3)? Do not include the time it takes to call the function, but you must include the clock cycles for returning. That is, count the clock cycles up until just before the next instruction is fetched after returning from foo. Include the branch delay slot in the instruction count.The first instruction is located at address 0x40003300.
int lst[] ={100,23,8};
int r = foo(lst,3);
Here is what I don't understand Don't we only execute 11 instructions from 1st to 10th then 4th agan? Or do we include the branch penality instructions also in which case we will have 15? Also how can 4 misses give 40 cycles? Also we never get far enough to execute 3 branches. Can someone just help me?
Answers
By following the program, we can see that there will be no memory accesses. Hence, we do not have any data cache misses. Only 4 instruction cache blocks will be touched. Hence, we have 4 instruction cache misses, which imply a cost of 40 clock cycles. If we count the number of executed instructions, we get 12 instructions without penalties for hazards. We do not have any stalls due to data hazards, but we have 3 branch instructions that give a penalty of 3 + 3 + 1 = 7 clock cycles. Hence, we have in total 40 + 12 + 7 = 59 clock cycles. If we count, we get 15 instructions, including the delay slot instructions. Hence, the CPI is 59/15.