Friday, September 9, 2011

MPI: Sendrecv

MPI provides a function MPI_Sendrecv that can be used to shift data along a group of processes. Within ESPResSo, this is not used at all so far. Instead, shifting data is done manually via calls to MPI_Send and MPI_Recv which have to be interleaved correctly.

Of course, doing it manually is far more complex and error-prone, but I also had the suspicion that it might be significantly slower, as there are plenty of possiblities to optimize code like that on the side of the MPI implmentation.

Test program

To test whether there are differences in speed, I have written a little C program that shifts a float value along a chain of processors, and does so a million times. The following is an excerpt from the test program that shows the core routines:

void do_sendrecv() {
  float send_data = rank;
  float recv_data;
  for (int i=0; i < N; i++) {
    MPI_Sendrecv(&send_data, 1, MPI_FLOAT, next, 42, 
   &recv_data, 1, MPI_FLOAT, prev, 42,
   comm, 0);
  }
}

void do_sendrecv_replace() {
  float data = rank;
  for (int i=0; i < N; i++) {
    MPI_Sendrecv_replace(&data, 1, MPI_FLOAT, 
    next, 42, prev, 42,
    comm, 0);
  }
}

void do_manual() {
  float data = rank;
  for (int i=0; i < N; i++) {
    if (rank%2) {
      MPI_Send(&data, 1, MPI_INT, next, 42, comm);
      MPI_Recv(&data, 1, MPI_INT, prev, 42, comm, 0);
    } else {
      MPI_Recv(&data, 1, MPI_INT, prev, 42, comm, 0);
      MPI_Send(&data, 1, MPI_INT, next, 42, comm);
    }
  }
}

Results

4 Tasks, OpenMPI 1.4.3 with gcc 4.5.0 (-O3) on an Intel Core2 Quad CPU (2.8 GHz):
sendrecv:         1.487827 s (1.000000)
sendrecv_replace: 1.490810 s (1.002005)
manual:           2.062941 s (1.386547)
That is a difference of 40%! From the results it seems obvious to me that it is a very good idea to use MPI_Sendrecv whenever it is applicable instead of doing so manually!