fftw3: An improved replacement for MPI_Alltoall
6.7.3 An improved replacement for MPI_Alltoall
----------------------------------------------
We close this section by noting that FFTW's MPI transpose routines can
be thought of as a generalization for the 'MPI_Alltoall' function
(albeit only for floating-point types), and in some circumstances can
function as an improved replacement.
'MPI_Alltoall' is defined by the MPI standard as:
int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcnt, MPI_Datatype recvtype,
MPI_Comm comm);
In particular, for 'double*' arrays 'in' and 'out', consider the
call:
MPI_Alltoall(in, howmany, MPI_DOUBLE, out, howmany MPI_DOUBLE, comm);
This is completely equivalent to:
MPI_Comm_size(comm, &P);
plan = fftw_mpi_plan_many_transpose(P, P, howmany, 1, 1, in, out, comm, FFTW_ESTIMATE);
fftw_execute(plan);
fftw_destroy_plan(plan);
That is, computing a P x P transpose on 'P' processes, with a block
size of 1, is just a standard all-to-all communication.
However, using the FFTW routine instead of 'MPI_Alltoall' may have
certain advantages. First of all, FFTW's routine can operate in-place
('in == out') whereas 'MPI_Alltoall' can only operate out-of-place.
Second, even for out-of-place plans, FFTW's routine may be faster,
especially if you need to perform the all-to-all communication many
times and can afford to use 'FFTW_MEASURE' or 'FFTW_PATIENT'. It should
certainly be no slower, not including the time to create the plan, since
one of the possible algorithms that FFTW uses for an out-of-place
transpose _is_ simply to call 'MPI_Alltoall'. However, FFTW also
considers several other possible algorithms that, depending on your MPI
implementation and your hardware, may be faster.