Your cart is currently empty!
Part I: Pair-Distribution Computation with CUDA In this part, you will write a CUDA program to compute a histogram nhist of atomic pair distances in molecular dynamics simulation: for all histogram bins i nhist[i] = 0 for all atomic pairs (i,j) is the Δ û] ++nhist[ë “# Here, “# distance between atomic…
Part I: Pair-Distribution Computation with CUDA
In this part, you will write a CUDA program to compute a histogram nhist of atomic pair distances in molecular dynamics simulation:
for all histogram bins i | |||||||||||||
nhist[i] = 0 | |||||||||||||
for all atomic pairs (i,j) | |||||||||||||
is the | Δ | û] | |||||||||||
++nhist[ë | “# | ||||||||||||
Here, | “# | distance between atomic pair (i, j) and Dr is the histogram bin size. The maximum atomic-pair | |||||||||||
distance | with the periodic boundary condition is the diagonal of half the simulation box, | ||||||||||||
&'( = | *+[-] / | , | |||||||||||
-01,3,4 | / |
and with Nhbin bins for the histogram, the bin size is Dr = Rmax/Nhbin. Here, al[a] is the simulation box size in the a-th direction. With the minimal-image convention, however, the maximum distance, for which the histogram is meaningful, is half the simulation box length, min– ∈{1,3,4}/2.
After computing the pair-distance histogram nhist, the pair distribution function (PDF) at distance ri = (i+1/2)Dr is defined as “ = ℎ ( )2 “/Δ , where r is the number density of atoms and N is the total number of atoms.
(Assignment)
Part II: MPI+OpenMP+CUDA Computation of p
In this part, you will write a triple-decker MPI+OpenMP+CUDA program to compute the value of p, by modifying the double-decker MPI+CUDA program, hypi_setdevice.cu, described in the lecture note on “Hybrid MPI+OpenMP+CUDA Programming”.
Your implementation should utilize two CPU cores and two GPU devices on each compute node. This is achieved by launching one MPI rank per node, where each rank spawns two OpenMP threads that run on different CPU cores and use different GPU devices as shown in the left figure on the next page. You can employ spatial decomposition in the MPI+OpenMP layer as follows (for the CUDA layer, leave the interleaved assignment of quadrature points to CUDA threads in hypi_setdevice.cu as it is); see the right figure on the next page.
#define NUM_DEVICE | 2 // # of GPU devices | = # of OpenMP threads | |
… | |||
// In main() | // My MPI rank | ||
MPI_Comm_rank(MPI_COMM_WORLD,&myid); | |||
MPI_Comm_size(MPI_COMM_WORLD,&nproc); | // # | of MPI processes | |
omp_set_num_threads(NUM_DEVICE); | // One OpenMP thread per GPU device | ||
nbin = NBIN/(nproc*NUM_DEVICE); | // # | of bins per OpenMP thread |
step = 1.0/(float)(nbin*nproc*NUM_DEVICE);
…
#pragma omp parallel private(list the variables that need private copies)
{
…
mpid = omp_get_thread_num();
offset = (NUM_DEVICE*myid+mpid)*step*nbin; // Quadrature-point offset
cudaSetDevice(mpid%2);
…
}
1
Make sure to list all variables that need private copies in the private clause for the omp parallel directive.
The above OpenMP multithreading will introduce a race condition for variable pi. This can be circumvented by data privatization, i.e., by defining float pid[NUM_DEVICE] and using the array elements as dedicated accumulators for the OepnMP threads (or GPU devices).
To report which of the two GPUs have been used for the run, insert the following lines within the OpenMP parallel block:
cudaGetDevice(&dev_used);
printf(“myid = %d; mpid = %d: device used = %d; partial pi = %f\n”, myid,mpid,dev_used,pid[mpid]);
where int dev_used is the ID of the GPU device (0 or 1) that was used, myid is the MPI rank, mpid is the OpenMP thread ID, pid[mpid] is a partial sum per OpenMP thread.
(Assignment)
myid = 0; mpid = 0: device used = 0; partial pi = 0.979926
myid = 0; mpid = 1: device used = 1; partial pi = 0.874671
myid = 1; mpid = 0: device used = 0; partial pi = 0.719409
myid = 1; mpid = 1: device used = 1; partial pi = 0.567582 PI = 3.141588
2