$30.00
Description
Part I: PairDistribution Computation with CUDA
In this part, you will write a CUDA program to compute a histogram nhist of atomic pair distances in molecular dynamics simulation:
for all histogram bins i 

nhist[i] = 0 

for all atomic pairs (i,j) 

is the 
Δ 
û] 

++nhist[ë 
“# 

Here, 
_{“}_{#} 
distance between atomic pair (i, j) and Dr is the histogram bin size. The maximum atomicpair 

distance 
with the periodic boundary condition is the diagonal of half the simulation box, 

&'( ^{=} 
*+[] ^{/} 
, 

01,3,4 
/ 
and with N_{hbin} bins for the histogram, the bin size is Dr = R_{max}/N_{hbin}. Here, al[a] is the simulation box size in the ath direction. With the minimalimage convention, however, the maximum distance, for which the histogram is meaningful, is half the simulation box length, min_{– ∈{1,3,4} }/2 .
After computing the pairdistance histogram nhist, the pair distribution function (PDF) at distance r_{i} = (i+1/2)Dr is defined as _{ “ } = ℎ ( ) 2 _{“}^{/}Δ , where r is the number density of atoms and N is the total number of atoms.
(Assignment)

Modify the sequential PDF computation program pdf0.c to a CUDA program, following the lecture note on “Pair distribution computation on GPU”. Submit your code.

Run your program by reading the atomic configuration pos.d (both pdf0.c and pos.d are available at the class homepage). Plot the resulting pair distribution function, using N_{hbin} = 2000. Submit your plot.
Part II: MPI+OpenMP+CUDA Computation of p
In this part, you will write a tripledecker MPI+OpenMP+CUDA program to compute the value of p, by modifying the doubledecker MPI+CUDA program, hypi_setdevice.cu, described in the lecture note on “Hybrid MPI+OpenMP+CUDA Programming”.
Your implementation should utilize two CPU cores and two GPU devices on each compute node. This is achieved by launching one MPI rank per node, where each rank spawns two OpenMP threads that run on different CPU cores and use different GPU devices as shown in the left figure on the next page. You can employ spatial decomposition in the MPI+OpenMP layer as follows (for the CUDA layer, leave the interleaved assignment of quadrature points to CUDA threads in hypi_setdevice.cu as it is); see the right figure on the next page.

#define NUM_DEVICE
2 // # of GPU devices
= # of OpenMP threads
…
// In main()
// My MPI rank
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Comm_size(MPI_COMM_WORLD,&nproc);
// #
of MPI processes
omp_set_num_threads(NUM_DEVICE);
// One OpenMP thread per GPU device
nbin = NBIN/(nproc*NUM_DEVICE);
// #
of bins per OpenMP thread
step = 1.0/(float)(nbin*nproc*NUM_DEVICE);
…
#pragma omp parallel private(list the variables that need private copies)
{
…
mpid = omp_get_thread_num();
offset = (NUM_DEVICE*myid+mpid)*step*nbin; // Quadraturepoint offset
cudaSetDevice(mpid%2);
…
}
1
Make sure to list all variables that need private copies in the private clause for the omp parallel directive.
The above OpenMP multithreading will introduce a race condition for variable pi. This can be circumvented by data privatization, i.e., by defining float pid[NUM_DEVICE] and using the array elements as dedicated accumulators for the OepnMP threads (or GPU devices).
To report which of the two GPUs have been used for the run, insert the following lines within the OpenMP parallel block:
cudaGetDevice(&dev_used);
printf(“myid = %d; mpid = %d: device used = %d; partial pi = %f\n”, myid,mpid,dev_used,pid[mpid]);
where int dev_used is the ID of the GPU device (0 or 1) that was used, myid is the MPI rank, mpid is the OpenMP thread ID, pid[mpid] is a partial sum per OpenMP thread.
(Assignment)

Submit your MPI+OpenMP+CUDA code.

Run your code on 2 nodes, requesting 2 cores and 2 GPUs per node. Submit your output, which should look like the following:
myid = 0; mpid = 0: device used = 0; partial pi = 0.979926
myid = 0; mpid = 1: device used = 1; partial pi = 0.874671
myid = 1; mpid = 0: device used = 0; partial pi = 0.719409
myid = 1; mpid = 1: device used = 1; partial pi = 0.567582 PI = 3.141588
2