fftw3: Advanced distributed-transpose interface

 
 6.7.2 Advanced distributed-transpose interface
 ----------------------------------------------
 
 The above routines are for a transpose of a matrix of numbers (of type
 'double'), using FFTW's default block sizes.  More generally, one can
 perform transposes of _tuples_ of numbers, with user-specified block
 sizes for the input and output:
 
      fftw_plan fftw_mpi_plan_many_transpose
                      (ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
                       ptrdiff_t block0, ptrdiff_t block1,
                       double *in, double *out, MPI_Comm comm, unsigned flags);
 
    In this case, one is transposing an 'n0' by 'n1' matrix of
 'howmany'-tuples (e.g.  'howmany = 2' for complex numbers).  The input
 is distributed along the 'n0' dimension with block size 'block0', and
 the 'n1' by 'n0' output is distributed along the 'n1' dimension with
 block size 'block1'.  If 'FFTW_MPI_DEFAULT_BLOCK' (0) is passed for a
 block size then FFTW uses its default block size.  To get the local size
 of the data on each process, you should then call
 'fftw_mpi_local_size_many_transposed'.