Solved-Assignment 7 -Solution

$35.00 $24.00

Please submit individual source files for coding exercises (see naming conventions below) and a single solution document for non-coding exercises (.txt or .pdf only). Your code and answers need to be documented to the point that the graders can understand your thought process. Full credit will not be awarded if sufficient work is not shown.…

You’ll get a: . zip file solution

 

 
Categorys:

Description

5/5 – (2 votes)

Please submit individual source files for coding exercises (see naming conventions below) and a single solution document for non-coding exercises (.txt or .pdf only). Your code and answers need to be documented to the point that the graders can understand your thought process. Full credit will not be awarded if sufficient work is not shown.

 

Suppose we’ve got a procedure that computes the inner product of two vectors u and v.

 

Consider the following C code:

 

void inner(float *u, float *v, int length, float *dest) {

 

int i;

float sum = 0.0f;

 

for (i = 0; i < length; ++i) {

 

sum += u[i] * v[i];

 

}

 

*dest = sum;

 

}

 

The x86-64 assembly code for the inner loop is as follows:

 

# u in %rbx, v in %rax, length in %rcx, i in %rdx, sum in %xmm1

 

 

.L87:

 

movss (%rbx, %rdx, 4), %xmm0

 

mulss (%rax, %rdx, 4), %xmm0

adds %xmm0, %xmm1

addq $1, %rdx

cmpq %rcx, %rdx

 

jl .L87
# Get u[i]

 

# Multiply by v[i]

# Add to sum

# Increment i

 

# Compare i to length

 

# If <, keep looping

 

 

 

  1. [20] Diagram how this instruction sequence would be decoded into operations and show the data dependencies between them. Use Figure 5.14 as a guide. Include your diagram in your solutions document.

 

  1. [20] Which operation(s) in the loop can NOT be pipelined? Why? What are the latencies of these operations? Based on this, what is the lower latency bound (in terms of CPE) of the procedure? Assume that float addition has a CPE of 3, float multiplication has a CPE of 5, and all integer operations have a CPI of 1. Write your answers in your solutions document.

 

  1. [40] Implement a procedure inner2 that is functionally equivalent to inner but uses four-way loop unrolling with four parallel accumulators. Also implement a main function to test your procedure. Name your source file 7-1.c.

 

  1. [20] Using your code from part 3, collect data on the execution times of inner and inner2 with varying vector lengths. Summarize your findings and argue whether inner or inner2 is more efficient than the other (or not). Create a graph using appropriate data points to support your argument. Include your summary and graph in your solutions document.

 

 

Zip the source files and solution document (if applicable), name the .zip file <Your Full Name>Assignment7.zip (e.g., EricWillsAssignment7.zip), and upload the .zip file to Canvas (see Assignments section for submission link).