fftw3: An improved replacement for MPI_Alltoall

 
 6.7.3 An improved replacement for MPI_Alltoall
 ----------------------------------------------
 
 We close this section by noting that FFTW's MPI transpose routines can
 be thought of as a generalization for the 'MPI_Alltoall' function
 (albeit only for floating-point types), and in some circumstances can
 function as an improved replacement.
 
    'MPI_Alltoall' is defined by the MPI standard as:
 
      int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
                       void *recvbuf, int recvcnt, MPI_Datatype recvtype,
                       MPI_Comm comm);
 
    In particular, for 'double*' arrays 'in' and 'out', consider the
 call:
 
      MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
 
    This is completely equivalent to:
 
      MPI_Comm_size(comm, &P);
      plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
      fftw_execute(plan);
      fftw_destroy_plan(plan);
 
    That is, computing a P x P transpose on 'P' processes, with a block
 size of 1, is just a standard all-to-all communication.
 
    However, using the FFTW routine instead of 'MPI_Alltoall' may have
 certain advantages.  First of all, FFTW's routine can operate in-place
 ('in == out') whereas 'MPI_Alltoall' can only operate out-of-place.
 
    Second, even for out-of-place plans, FFTW's routine may be faster,
 especially if you need to perform the all-to-all communication many
 times and can afford to use 'FFTW_MEASURE' or 'FFTW_PATIENT'.  It should
 certainly be no slower, not including the time to create the plan, since
 one of the possible algorithms that FFTW uses for an out-of-place
 transpose _is_ simply to call 'MPI_Alltoall'.  However, FFTW also
 considers several other possible algorithms that, depending on your MPI
 implementation and your hardware, may be faster.