Distributed memory computing and GPU computing are two different parallel programming models. In this section, you will learn how to put these two parallel models together, and that will speed up your running time. As always we will look at the Hello World program using hybrid environment CUDA and MPI. In order to combine CUDA and MPI, we need to get their codes to communicate to each other during the compilation. Let’s look at the Hello World program below.
CUDA program
1 2 3 4 5 6 7 8 9 10 11 | #include <stdio.h>
#include <cuda.h>
/* kernel function for GPU */
__global__ void kernel(void) {
}
extern "C" void hello(void) {
kernel<<<1, 1>>>();
printf("Hello World !\n");
}
|
MPI program integrated with CUDA
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | #include <mpi.h>
#define MAX 80 /* maximum characters for naming the node */
/* Declaring the CUDA function */
void hello();
int main(int argc, char *argv[]) {
int rank, nprocs, len;
char name[MAX]; /* char array for storing the name of each node */
/* Initializing the MPI execution environment */
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(name, &len);
/* Call CUDA function */
hello();
/* Print the rank, size, and name of each node */
printf("I am %d of %d on %s\n", rank, size, name);
/*Terminating the MPI environment*/
MPI_Finalize();
return 0;
}
|
Comments: |
|
---|
The most common way of compiling a heterogeneous program MPI and Cuda is:
Make a CUDA object from the CUDA program. This can be done by using this command on the terminal:
nvcc -c cuda.cu -o cuda.oMake an MPI object from MPI program. This can be done by using this command on the terminal:
mpicc -c mpi.c -o mpi.oMake an executable file from both objects. This can be done by using this command on the terminal:
mpicc -o cudampi mpi.o cuda.o -L/usr/local/cuda/lib64 -lcudart
To execute the executable file, cudampi, we can enter the following command on the terminal:
mpirun -machinefile machines -x LD_LIBRARY_PATH -np #processes ./cudampi
We use -x to export the environment variables to the remote nodes before executing program.
- In order to time a heterogeneous CUDA and MPI program, you just need to use MPI_Wtime() function as in an MPI program.
- We need to keep in mind that a heterogeneous CUDA and MPI program theoretically has a lower running time than an MPI does; however, running time also depends on each node’s properties such as memory. Copying data from the host (CPU) to a device (GPU) may take a long period of time, which results in a much longer running time for a heterogeneous program. Therefore, you do not always get benefits from the heterogeneous programming model.
In this activity, we are going to compute vector addition by using a hybrid programming model with CUDA and MPI. Vector addition is very simple and easy. Suppose we have vector A and vector B, and both have the same length. To add vector A and B, we just add the corresponding elements of A and B. This results a new vector of the same length.
Comments on CUDA Program: | |
---|---|
|
|
Comments on MPI Program: | |
|