fftw3: FFTW MPI Performance Tips

 
 6.10 FFTW MPI Performance Tips
 ==============================
 
 In this section, we collect a few tips on getting the best performance
 out of FFTW's MPI transforms.
 
    First, because of the 1d block distribution, FFTW's parallelization
 is currently limited by the size of the first dimension.
 (Multidimensional block distributions may be supported by a future
 version.)  More generally, you should ideally arrange the dimensions so
 that FFTW can divide them equally among the processes.  SeeLoad
 balancing.
 
    Second, if it is not too inconvenient, you should consider working
 with transposed output for multidimensional plans, as this saves a
 considerable amount of communications.  SeeTransposed
 distributions.
 
    Third, the fastest choices are generally either an in-place transform
 or an out-of-place transform with the 'FFTW_DESTROY_INPUT' flag (which
 allows the input array to be used as scratch space).  In-place is
 especially beneficial if the amount of data per process is large.
 
    Fourth, if you have multiple arrays to transform at once, rather than
 calling FFTW's MPI transforms several times it usually seems to be
 faster to interleave the data and use the advanced interface.  (This
 groups the communications together instead of requiring separate
 messages for each transform.)