I was following through this example (From the subheading Simple Multiple-Issue Code Scheduling in chapter 4.10 of Computer Organization and Design RISC-V Edition by Hennesy, Patterson), an example related to VLIWs:
How would this loop be scheduled on a static two-issue pipeline for RISC-V?
Loop: ldx31, 0(x20) // x31=array elementaddx31, x31, x21 // add scalar in x21
sdx31, 0(x20) // store result
addix20, x20, -8 // decrement pointer
bltx22, x20, Loop // compare to loop limit,
// branch if x20 > x22Reorder the instructions to avoid as many pipeline stalls as possible. Assume branches are predicted, so that control hazards are handled by the hardware.
(the book does not have a space between the instructions and rd in the example as you can see above. I assume this is a typo but wanted to quote the book)
Worth to mention that Slot 1 in the processor is for ALU or branch instructions and slot 2 is for data transfer instructions, according to the example.
The answer is that the instructions can be ordered as following (Fig 4.67 of the book mentioned above):
| ALU or branch instruction | Data transfer instruction | Clock cycle |
|---|---|---|
| ld x31, 0(x20) | 1 | |
| addi x20, x20, -8 | 2 | |
| add x31, x31, x21 | 3 | |
| blt x22, x20, loop | sd x31, 8(x20) | 4 |
Here comes my question: If we look at clock cycle 4, then the blt and the sd instuctions will execute in parallell. I am a little confused here, because I would assume that when the blt evaluates to true, it will update the PC to jump to loop:. From what I am used to, this will happen in the Execute stage or in the Decode stage. Will this not flush the sd instruction, ending up in not saving the data? Or do we assume that the processor handles this?