Question 5 Deep Pipe Branching [35 pts] You are designing the branch control logic for your company's flagship RISC-V pr
Posted: Tue May 24, 2022 8:46 am
Question 5 Deep Pipe Branching [35 pts] You are designing the branch control logic for your company's flagship RISC-V processor A. Processor A has 15 pipeline stages, the first few stages and their descriptions are shown below: Stage Name Description 1 IF1 Begin instruction fetch 2 IF2 Complete instruction fetch Begin instruction decode 3 ID1 4 ID2 Complete instruction decode; PC- relative branch/jump target ready in this stage. 5 REG1 Begin register file access 6 REG2 Complete register file access; Register-relative branch/jump target ready in this stage. 7 EX1 Begin ALU execution 8 EX2 ALU execution; Branch direction ready in this stage. 9 EX3 Complete ALU execution ⠀ ...
You are examining the performance of the following particularly important benchmark pro- gram for your company: Ox0AFFFFD8| # 1 addi x1, zero, 100 5 OxOAFFFFDCI addi x2, zero, # 2 OXOAFFFFEO | # 3 addi x3, zero, OxA00 addi x4, zero, 0xB00 Ox0AFFFFE8 loop: sub x3, x3, 4 OXOAFFFFE41 # 4 # 5 OXOAFFFFECI sub x4, x4, 4 # 6 OXOAFFFFF0| 1w x5, 0(x3) # 7 OxOAFFFFF41 lw x6, 0(x4) # 8 OXOAFFFFF4| add x7, x7, x5 # 9 OxOAFFFFF8| add x7, x7, x6 # 10 OxOAFFFFFCI sub x1, x1, x2 # 11 0x0B0000001 bne x1, zero, loop # 12 0x0B0000041 sll x1, x7, 2 # 13 0x0B0000081 add x1, x1, x7 # 14 Ox0B00000C1 sll x5, x5, 3 # 15 0x0B0000101 SW x5, 0(x3) # 16 Ox0B0000141 sll x6, x6, 3 # 17 0x0B0000181 SW x6, 0(x4) # 18 0x0B00001CI SW x7, 4(x3) # 19 0x0B0000201 SW x1, 4(x4) # 20
Part(a) [5 pts] Consider the branch instruction for this part. Assume processor A relies on software to fill instructions in the branch delay slots. How many delay slots are needed given the above pipeline configuration? Explain your answer by showing the sequence of instructions in the above program that will be fetched in the pipeline between the branch instruction at Ox0B000000 to the branch target instruction at OxOAFFFFE8.
Part (b) [5 pts] Rearrange the above code to fill in the necessary delay slots without alternating the program's original function. Put the label "loop" at the appropriate instruction as needed. Write NOP for slots that you cannot fill with instruction. To save time, you do not need to copy the entire instruction. Simply refer to an instruction by its instruction number shown in the comment on the right. You may not need to fill in all rows. You can assume full forwarding is available in the processor. Address Label Instruction Number OxOAFFFFD8 1 OxOAFFFFDC OXOAFFFFEO OXOAFFFFE4 OxOAFFFFE8 OxOAFFFFEC OXOAFFFFFO OxOAFFFFF4 OXOAFFFFF4
OxOAFFFFF4 OxOAFFFFF4 OxOAFFFFF8 OxOAFFFFFC Ox0B000000 0x0B000004 0x0B000008 Ox0B00000C OxOB000010 OxOB000014 OxOB000018 OxOB00001C Ox0B000020 Ox0B000024 Ox0B000028 Ox0B00002C Ox0B000030
Part (c) [5 pts] After you have fabricated Processor A hardware, you found that the jal and jalr instructions are not behaving correctly: you observed that some instructions that should be skipped were executed as normal. In this buggy hardware, there is no special hardware to detect and handle these jump instructions. Both jal and jalr are executed correctly according to the pipeline definition above. What is likely the cause of the problem with jal and jalr? Since the hardware is already fixed, can a software re-compilation produce code that work works correctly on this buggy Processor A to support both jal and jalr? Explain what the software compiler must do. If there is no way to support jal and jalr in the current processor with only software technique, explain why that is the case.
Part (d) [5 pts] Your company decides to implement a branch prediction scheme to improve branch performance. Let h be the portion of branch instructions that the predictor can correctly predict the branch direction, where 0 ≤h≤1. Assume all non-branch instructions have CPI= 1 in processor A. A program P contains 15% branch instructions. What is the average CPI of program P when runs on processor A in terms of h?
Part(e) [5 pts] Let f be the portion of delay slots that a typical compiler can fill with the pipeline design of processor A, where 0≤ f≤1. What percentage of branch prediction hit (h) is needed for the branch prediction scheme to result in faster program execution of P than the original scheme that uses delay slot?
Part (f) [10 pts] Based on your answers above, or otherwise, explain how the depth of pipeline may affect the effectiveness of using branch delay slots and the effectiveness of using a branch predictor. Pay particular attention to the stage at which branch directions are resolved.
You are examining the performance of the following particularly important benchmark pro- gram for your company: Ox0AFFFFD8| # 1 addi x1, zero, 100 5 OxOAFFFFDCI addi x2, zero, # 2 OXOAFFFFEO | # 3 addi x3, zero, OxA00 addi x4, zero, 0xB00 Ox0AFFFFE8 loop: sub x3, x3, 4 OXOAFFFFE41 # 4 # 5 OXOAFFFFECI sub x4, x4, 4 # 6 OXOAFFFFF0| 1w x5, 0(x3) # 7 OxOAFFFFF41 lw x6, 0(x4) # 8 OXOAFFFFF4| add x7, x7, x5 # 9 OxOAFFFFF8| add x7, x7, x6 # 10 OxOAFFFFFCI sub x1, x1, x2 # 11 0x0B0000001 bne x1, zero, loop # 12 0x0B0000041 sll x1, x7, 2 # 13 0x0B0000081 add x1, x1, x7 # 14 Ox0B00000C1 sll x5, x5, 3 # 15 0x0B0000101 SW x5, 0(x3) # 16 Ox0B0000141 sll x6, x6, 3 # 17 0x0B0000181 SW x6, 0(x4) # 18 0x0B00001CI SW x7, 4(x3) # 19 0x0B0000201 SW x1, 4(x4) # 20
Part(a) [5 pts] Consider the branch instruction for this part. Assume processor A relies on software to fill instructions in the branch delay slots. How many delay slots are needed given the above pipeline configuration? Explain your answer by showing the sequence of instructions in the above program that will be fetched in the pipeline between the branch instruction at Ox0B000000 to the branch target instruction at OxOAFFFFE8.
Part (b) [5 pts] Rearrange the above code to fill in the necessary delay slots without alternating the program's original function. Put the label "loop" at the appropriate instruction as needed. Write NOP for slots that you cannot fill with instruction. To save time, you do not need to copy the entire instruction. Simply refer to an instruction by its instruction number shown in the comment on the right. You may not need to fill in all rows. You can assume full forwarding is available in the processor. Address Label Instruction Number OxOAFFFFD8 1 OxOAFFFFDC OXOAFFFFEO OXOAFFFFE4 OxOAFFFFE8 OxOAFFFFEC OXOAFFFFFO OxOAFFFFF4 OXOAFFFFF4
OxOAFFFFF4 OxOAFFFFF4 OxOAFFFFF8 OxOAFFFFFC Ox0B000000 0x0B000004 0x0B000008 Ox0B00000C OxOB000010 OxOB000014 OxOB000018 OxOB00001C Ox0B000020 Ox0B000024 Ox0B000028 Ox0B00002C Ox0B000030
Part (c) [5 pts] After you have fabricated Processor A hardware, you found that the jal and jalr instructions are not behaving correctly: you observed that some instructions that should be skipped were executed as normal. In this buggy hardware, there is no special hardware to detect and handle these jump instructions. Both jal and jalr are executed correctly according to the pipeline definition above. What is likely the cause of the problem with jal and jalr? Since the hardware is already fixed, can a software re-compilation produce code that work works correctly on this buggy Processor A to support both jal and jalr? Explain what the software compiler must do. If there is no way to support jal and jalr in the current processor with only software technique, explain why that is the case.
Part (d) [5 pts] Your company decides to implement a branch prediction scheme to improve branch performance. Let h be the portion of branch instructions that the predictor can correctly predict the branch direction, where 0 ≤h≤1. Assume all non-branch instructions have CPI= 1 in processor A. A program P contains 15% branch instructions. What is the average CPI of program P when runs on processor A in terms of h?
Part(e) [5 pts] Let f be the portion of delay slots that a typical compiler can fill with the pipeline design of processor A, where 0≤ f≤1. What percentage of branch prediction hit (h) is needed for the branch prediction scheme to result in faster program execution of P than the original scheme that uses delay slot?
Part (f) [10 pts] Based on your answers above, or otherwise, explain how the depth of pipeline may affect the effectiveness of using branch delay slots and the effectiveness of using a branch predictor. Pay particular attention to the stage at which branch directions are resolved.