A.3 Vector Extension You are working with a new processor with a vector extension unit. The vector extension unit provid

Post by **answerhappygod** » Mon May 02, 2022 12:02 pm

: A 3 Vector Extension You Are Working With A New Processor With A Vector Extension Unit The Vector Extension Unit Provid 1 (31.28 KiB) Viewed 39 times

: A 3 Vector Extension You Are Working With A New Processor With A Vector Extension Unit The Vector Extension Unit Provid 2 (43.72 KiB) Viewed 39 times

Please answer A.3.3. I will thumb up if the answer is
helpful.
A.3 Vector Extension You are working with a new processor with a vector extension unit. The vector extension unit provides SIMD support for vector operations on a set of dedicated 512-bit registers: . 16 SIMD registers, 512 bit wide, named xro to xr15 • Support SIMD operations on 16 single-precision floating numbers in each cycle • Special instruction to load /store data from main memory to/from SIMD registers • Special instruction to pass 32-bit values to/from the general purpose register file
A.3.1 The following code forms the performance bottleneck of your application: fld for (i = 0; i < N; i++) { A A * B + B * C; → } # x5 = N 11 | loop: fld f1, 0(x3) # load B 12 fld f2, 0(x4) # load C 131 fmult f2, f1, f2 # B * C 14! f0, 0(x2) # load A 151 fmult fo, f1, fo # A * B 16 fadd fo, fo, f2 17 fsd fo, 0(x2) 18 addi x1, x1, # incr i 19 addi x2, x2, 4 1101 addi x3, x3, 1111 addi x4, x4, 4 1121 bne x1, x5, loop Ignore stalls due to data dependency and cache for this part. Assume fmult takes 20 cycles, fadd takes 3 cycles, and all the remaining instructions take 1 cycle. What are the values of (i) the average CPI of the above code segment; and (ii) total run time in terms of N and cycles; (iii) Number of floating point operations (add or multiply) per cycle? Average CPI: Total run time: cycles Number of floating point ops per cycle
Latency lo- 16 A.3.2 The vector extension provides the following instructions: Instruction Description vxlf vxD, offset (xs1) Load 16 single-precision floating point values from memory cation offset + (xs1) into vector extension register vxD vxsf vxs, offset (xs1) Store 16 single-precision floating point values from vector exten- sion register vxS to memory location offset + (xs1) vxadd vxD, vxS1, vxS2 Add vector in vxS1 and vxS2 and store results in destination vsD vxmult vxD, vxSi, vxS2 Multiply each element of vxS1 and vxS2 and store results in destination vsD 16 5 10
Assume N is a multiple of 16, complete the following code using the new vector extension instructions: # x5 = N vxloop: vxlf vx0, 0(x3) # load B to B[i+15] bne x1, x5, vxloop
A.3.3 Based on your code from the previous part, by using the new vector extension, what are the values of (i) the average CPI of the above code segment; and (ii) total run time in terms of N and cycles; (iii) Number of floating point operations (add or multiply) per cycle? Average CPI: Total run time: cycles Number of floating point ops per cycle