A.3 Vector Extension You are working with a new processor with a vector extension unit. The vector extension unit provid
Posted: Mon May 02, 2022 12:02 pm
Please answer A.3.4 to A.3.6. I will thumb up if the answer is
helpful.
A.3 Vector Extension You are working with a new processor with a vector extension unit. The vector extension unit provides SIMD support for vector operations on a set of dedicated 512-bit registers: . 16 SIMD registers, 512 bit wide, named xro to xr15 • Support SIMD operations on 16 single-precision floating numbers in each cycle • Special instruction to load /store data from main memory to/from SIMD registers • Special instruction to pass 32-bit values to/from the general purpose register file
A.3.1 The following code forms the performance bottleneck of your application: fld for (i = 0; i < N; i++) { A A * B + B * C; → } # x5 = N 11 | loop: fld f1, 0(x3) # load B 12 fld f2, 0(x4) # load C 131 fmult f2, f1, f2 # B * C 14! f0, 0(x2) # load A 151 fmult fo, f1, fo # A * B 16 fadd fo, fo, f2 17 fsd fo, 0(x2) 18 addi x1, x1, # incr i 19 addi x2, x2, 4 1101 addi x3, x3, 1111 addi x4, x4, 4 1121 bne x1, x5, loop Ignore stalls due to data dependency and cache for this part. Assume fmult takes 20 cycles, fadd takes 3 cycles, and all the remaining instructions take 1 cycle. What are the values of (i) the average CPI of the above code segment; and (ii) total run time in terms of N and cycles; (iii) Number of floating point operations (add or multiply) per cycle? Average CPI: Total run time: cycles Number of floating point ops per cycle
Latency lo- 16 A.3.2 The vector extension provides the following instructions: Instruction Description vxlf vxD, offset (xs1) Load 16 single-precision floating point values from memory cation offset + (xs1) into vector extension register vxD vxsf vxs, offset (xs1) Store 16 single-precision floating point values from vector exten- sion register vxS to memory location offset + (xs1) vxadd vxD, vxS1, vxS2 Add vector in vxS1 and vxS2 and store results in destination vsD vxmult vxD, vxSi, vxS2 Multiply each element of vxS1 and vxS2 and store results in destination vsD 16 5 10
Assume N is a multiple of 16, complete the following code using the new vector extension instructions: # x5 = N vxloop: vxlf vx0, 0(x3) # load B to B[i+15] bne x1, x5, vxloop
a A.3.4 Your processor has a data cache with the following characteristics: • Capacity: 1 MiB . Physically tagged • 4-way set associative, True LRU • 8-words line size The 3 arrays are located at the following physical addresses: Array Address A[] B[] C[]) Ox10000000 Ox10000000 +4N Ox10000000 +8N A vxlf instruction is equivalent to loading 16 consecutive words from memory into the vector extension register with 1 instruction. In contrast, the original code loads the data 1 word per iteration. Comparing the 2 code segments, after executing the entire loop, which code segment is likely to spend more time loading data due to the effect of cache? Explain your answer in terms of spatial and temporal locality, and use concrete example to support your answer. Consider 2 cases: N = 1024 and N = 220. A.3.5 If the processor cache is changed to a 4 MiB direct map cache, how would it change the data access behavior of the vectorized and original code? A.3.6 If instead the arrays are allocated at the following locations: Array A[] Address Ox10000000 Ox10000010 +4N Ox10000020 +8N B[] C[] How would that affect the data access behavior of the vectorized and original code?
helpful.
A.3 Vector Extension You are working with a new processor with a vector extension unit. The vector extension unit provides SIMD support for vector operations on a set of dedicated 512-bit registers: . 16 SIMD registers, 512 bit wide, named xro to xr15 • Support SIMD operations on 16 single-precision floating numbers in each cycle • Special instruction to load /store data from main memory to/from SIMD registers • Special instruction to pass 32-bit values to/from the general purpose register file
A.3.1 The following code forms the performance bottleneck of your application: fld for (i = 0; i < N; i++) { A A * B + B * C; → } # x5 = N 11 | loop: fld f1, 0(x3) # load B 12 fld f2, 0(x4) # load C 131 fmult f2, f1, f2 # B * C 14! f0, 0(x2) # load A 151 fmult fo, f1, fo # A * B 16 fadd fo, fo, f2 17 fsd fo, 0(x2) 18 addi x1, x1, # incr i 19 addi x2, x2, 4 1101 addi x3, x3, 1111 addi x4, x4, 4 1121 bne x1, x5, loop Ignore stalls due to data dependency and cache for this part. Assume fmult takes 20 cycles, fadd takes 3 cycles, and all the remaining instructions take 1 cycle. What are the values of (i) the average CPI of the above code segment; and (ii) total run time in terms of N and cycles; (iii) Number of floating point operations (add or multiply) per cycle? Average CPI: Total run time: cycles Number of floating point ops per cycle
Latency lo- 16 A.3.2 The vector extension provides the following instructions: Instruction Description vxlf vxD, offset (xs1) Load 16 single-precision floating point values from memory cation offset + (xs1) into vector extension register vxD vxsf vxs, offset (xs1) Store 16 single-precision floating point values from vector exten- sion register vxS to memory location offset + (xs1) vxadd vxD, vxS1, vxS2 Add vector in vxS1 and vxS2 and store results in destination vsD vxmult vxD, vxSi, vxS2 Multiply each element of vxS1 and vxS2 and store results in destination vsD 16 5 10
Assume N is a multiple of 16, complete the following code using the new vector extension instructions: # x5 = N vxloop: vxlf vx0, 0(x3) # load B to B[i+15] bne x1, x5, vxloop
a A.3.4 Your processor has a data cache with the following characteristics: • Capacity: 1 MiB . Physically tagged • 4-way set associative, True LRU • 8-words line size The 3 arrays are located at the following physical addresses: Array Address A[] B[] C[]) Ox10000000 Ox10000000 +4N Ox10000000 +8N A vxlf instruction is equivalent to loading 16 consecutive words from memory into the vector extension register with 1 instruction. In contrast, the original code loads the data 1 word per iteration. Comparing the 2 code segments, after executing the entire loop, which code segment is likely to spend more time loading data due to the effect of cache? Explain your answer in terms of spatial and temporal locality, and use concrete example to support your answer. Consider 2 cases: N = 1024 and N = 220. A.3.5 If the processor cache is changed to a 4 MiB direct map cache, how would it change the data access behavior of the vectorized and original code? A.3.6 If instead the arrays are allocated at the following locations: Array A[] Address Ox10000000 Ox10000010 +4N Ox10000020 +8N B[] C[] How would that affect the data access behavior of the vectorized and original code?