Info: (fftw3) FFTW MPI Performance Tips

Info Catalog

fftw3: Avoiding MPI Deadlocks

fftw3: Distributed-memory FFTW with MPI

fftw3: Combining MPI and Threads

fftw3: FFTW MPI Performance Tips

 
 6.10 FFTW MPI Performance Tips
 ==============================
 
 In this section, we collect a few tips on getting the best performance
 out of FFTW's MPI transforms.
 
    First, because of the 1d block distribution, FFTW's parallelization
 is currently limited by the size of the first dimension.
 (Multidimensional block distributions may be supported by a future
 version.)  More generally, you should ideally arrange the dimensions so
 that FFTW can divide them equally among the processes.  Load
 balancing.
 
    Second, if it is not too inconvenient, you should consider working
 with transposed output for multidimensional plans, as this saves a
 considerable amount of communications.  Transposed
 distributions.
 
    Third, the fastest choices are generally either an in-place transform
 or an out-of-place transform with the 'FFTW_DESTROY_INPUT' flag (which
 allows the input array to be used as scratch space).  In-place is
 especially beneficial if the amount of data per process is large.
 
    Fourth, if you have multiple arrays to transform at once, rather than
 calling FFTW's MPI transforms several times it usually seems to be
 faster to interleave the data and use the advanced interface.  (This
 groups the communications together instead of requiring separate
 messages for each transform.)

Info Catalog

fftw3: Avoiding MPI Deadlocks

fftw3: Distributed-memory FFTW with MPI

fftw3: Combining MPI and Threads