Question 2 Dot Products [35 pts] Dot products of two vectors form the basis of many important computer programs. Let a and 7 be vectors of length N, and let ak and ™ where k = 0, 1,..., N-1 be the elements of the two vectors respectively, then the dot product computes: N-1 b=a.7 = akk (1) k=0 The following is the cache design of your 32-bit processor: 1 MiB Capacity Organization Direct Map Line width 4 words Write policy Write back; write allocate Hit time 1 cycle Miss penalty 200 cycles General latency reading from memory 196+b, where b is the consecutive read- ing block size in words. You can ignore effects of pipeline and assume base CPI= 1 for all instructions in this processor, except multiplication has CPI= 30.
Part(a) [7 pts] A simple implementation of dot products is shown below: float dotproduct (float a[N], float x[N]) { int k; float v; V = 0; for (k = 0; k< N; k++) { v = v + a[k] * x[k]; } return v; } A float is 32 bits wide. Let N = 1024, base address of a is at OxA000000, and base addresss of x is at OxB000000. Describe the pattern of data cache hit/miss that the above code produces. What is the miss rate? Assume the data cache was initially empty.
Part (b) [14 pts] Assuming the array addresses of a and x cannot be changed, rearrange or rewrite the code such that the cache performance can be improved. (i) Show your newly arranged code; (ii) Explain the memory access pattern and the cache miss rate. (iii) Estimate the performance improvement.
Part (c) [14 pts] Your project team discovered an experimental code using an undocumented vector ISA extension in your processor. Your task is to evaluate its performance. The following show a C-like pseudo code for this implementation with vector extension. The datatype vector<float> refers to a 16-element vector of float stored in a dedicated vector register file with 8 vector registers. float dotproduct_v (float a[N], float x[N]) { vector<float> vtmp; // A vector of 16 floats float result = 0.0; for (int i=0; i < N; 1 + 16) { vtmp vmult (a, i, x, 1); result result + vreduce (vtmp); } return result; } vector<float> vmult (float a[N], int aoff, float x[N], int xoff) { vector<float> res, tmp0, tmp1; tmp0 < Read 16 floats from memory location (a + 4 * aoff) tmp1 < Read 16 floats from memory location (x + 4* xoff) res Perform 16 multiplicaltions in parallel for each element in tmpo and tmp1. return res; } float vreduce (vector<float> vtmp) { float res; // res = add all elements in vtmp, latency around 20 cycles. return res; }
Estimate the performance of using this vectorized dot product computation when compared to the original baseline implementaiton in Part A. Quantify your performance comparison. List any assumptions you have made with the architecture. Hint: Focus on cache performance memory access time in both cases, and estimate the time needed for computation.
Question 2 Dot Products [35 pts] Dot products of two vectors form the basis of many important computer programs. Let a a
-
- Site Admin
- Posts: 899603
- Joined: Mon Aug 02, 2021 8:13 am