Assignment 6 Hybrid MPI+OpenMP+CUDA Programming Solution



Part I: Pair-Distribution Computation with CUDA

In this part, you will write a CUDA program to compute a histogram nhist of atomic pair distances in molecular dynamics simulation:

for all histogram bins i

nhist[i] = 0

for all atomic pairs (i,j)

is the







distance between atomic pair (i, j) and Dr is the histogram bin size. The maximum atomic-pair


with the periodic boundary condition is the diagonal of half the simulation box,

&'( =

*+[-] /




and with Nhbin bins for the histogram, the bin size is Dr = Rmax/Nhbin. Here, al[a] is the simulation box size in the a-th direction. With the minimal-image convention, however, the maximum distance, for which the histogram is meaningful, is half the simulation box length, min– ∈{1,3,4} /2 .

After computing the pair-distance histogram nhist, the pair distribution function (PDF) at distance ri = (i+1/2)Dr is defined as = ℎ ( ) 2 /Δ , where r is the number density of atoms and N is the total number of atoms.


  1. Modify the sequential PDF computation program pdf0.c to a CUDA program, following the lecture note on “Pair distribution computation on GPU”. Submit your code.

  2. Run your program by reading the atomic configuration pos.d (both pdf0.c and pos.d are available at the class homepage). Plot the resulting pair distribution function, using Nhbin = 2000. Submit your plot.

Part II: MPI+OpenMP+CUDA Computation of p

In this part, you will write a triple-decker MPI+OpenMP+CUDA program to compute the value of p, by modifying the double-decker MPI+CUDA program,, described in the lecture note on “Hybrid MPI+OpenMP+CUDA Programming”.

Your implementation should utilize two CPU cores and two GPU devices on each compute node. This is achieved by launching one MPI rank per node, where each rank spawns two OpenMP threads that run on different CPU cores and use different GPU devices as shown in the left figure on the next page. You can employ spatial decomposition in the MPI+OpenMP layer as follows (for the CUDA layer, leave the interleaved assignment of quadrature points to CUDA threads in as it is); see the right figure on the next page.

#define NUM_DEVICE

2 // # of GPU devices

= # of OpenMP threads

// In main()

// My MPI rank



// #

of MPI processes


// One OpenMP thread per GPU device

nbin = NBIN/(nproc*NUM_DEVICE);

// #

of bins per OpenMP thread

step = 1.0/(float)(nbin*nproc*NUM_DEVICE);

#pragma omp parallel private(list the variables that need private copies)


mpid = omp_get_thread_num();

offset = (NUM_DEVICE*myid+mpid)*step*nbin; // Quadrature-point offset




Make sure to list all variables that need private copies in the private clause for the omp parallel directive.

The above OpenMP multithreading will introduce a race condition for variable pi. This can be circumvented by data privatization, i.e., by defining float pid[NUM_DEVICE] and using the array elements as dedicated accumulators for the OepnMP threads (or GPU devices).

To report which of the two GPUs have been used for the run, insert the following lines within the OpenMP parallel block:


printf(“myid = %d; mpid = %d: device used = %d; partial pi = %f\n”, myid,mpid,dev_used,pid[mpid]);

where int dev_used is the ID of the GPU device (0 or 1) that was used, myid is the MPI rank, mpid is the OpenMP thread ID, pid[mpid] is a partial sum per OpenMP thread.


  1. Submit your MPI+OpenMP+CUDA code.

  1. Run your code on 2 nodes, requesting 2 cores and 2 GPUs per node. Submit your output, which should look like the following:

myid = 0; mpid = 0: device used = 0; partial pi = 0.979926

myid = 0; mpid = 1: device used = 1; partial pi = 0.874671

myid = 1; mpid = 0: device used = 0; partial pi = 0.719409

myid = 1; mpid = 1: device used = 1; partial pi = 0.567582 PI = 3.141588