fftw3: MPI Data Distribution

 
 6.4 MPI Data Distribution
 =========================
 
 The most important concept to understand in using FFTW's MPI interface
 is the data distribution.  With a serial or multithreaded FFT, all of
 the inputs and outputs are stored as a single contiguous chunk of
 memory.  With a distributed-memory FFT, the inputs and outputs are
 broken into disjoint blocks, one per process.
 
    In particular, FFTW uses a _1d block distribution_ of the data,
 distributed along the _first dimension_.  For example, if you want to
 perform a 100 x 200 complex DFT, distributed over 4 processes, each
 process will get a 25 x 200 slice of the data.  That is, process 0 will
 get rows 0 through 24, process 1 will get rows 25 through 49, process 2
 will get rows 50 through 74, and process 3 will get rows 75 through 99.
 If you take the same array but distribute it over 3 processes, then it
 is not evenly divisible so the different processes will have unequal
 chunks.  FFTW's default choice in this case is to assign 34 rows to
 processes 0 and 1, and 32 rows to process 2.
 
    FFTW provides several 'fftw_mpi_local_size' routines that you can
 call to find out what portion of an array is stored on the current
 process.  In most cases, you should use the default block sizes picked
 by FFTW, but it is also possible to specify your own block size.  For
 example, with a 100 x 200 array on three processes, you can tell FFTW to
 use a block size of 40, which would assign 40 rows to processes 0 and 1,
 and 20 rows to process 2.  FFTW's default is to divide the data equally
 among the processes if possible, and as best it can otherwise.  The rows
 are always assigned in "rank order," i.e.  process 0 gets the first
 block of rows, then process 1, and so on.  (You can change this by using
 'MPI_Comm_split' to create a new communicator with re-ordered
 processes.)  However, you should always call the 'fftw_mpi_local_size'
 routines, if possible, rather than trying to predict FFTW's distribution
 choices.
 
    In particular, it is critical that you allocate the storage size that
 is returned by 'fftw_mpi_local_size', which is _not_ necessarily the
 size of the local slice of the array.  The reason is that intermediate
 steps of FFTW's algorithms involve transposing the array and
 redistributing the data, so at these intermediate steps FFTW may require
 more local storage space (albeit always proportional to the total size
 divided by the number of processes).  The 'fftw_mpi_local_size'
 functions know how much storage is required for these intermediate steps
 and tell you the correct amount to allocate.
 

Menu