fftw3: Advanced distributed-transpose interface
6.7.2 Advanced distributed-transpose interface
----------------------------------------------
The above routines are for a transpose of a matrix of numbers (of type
'double'), using FFTW's default block sizes. More generally, one can
perform transposes of _tuples_ of numbers, with user-specified block
sizes for the input and output:
fftw_plan fftw_mpi_plan_many_transpose
(ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t howmany,
ptrdiff_t block0, ptrdiff_t block1,
double *in, double *out, MPI_Comm comm, unsigned flags);
In this case, one is transposing an 'n0' by 'n1' matrix of
'howmany'-tuples (e.g. 'howmany = 2' for complex numbers). The input
is distributed along the 'n0' dimension with block size 'block0', and
the 'n1' by 'n0' output is distributed along the 'n1' dimension with
block size 'block1'. If 'FFTW_MPI_DEFAULT_BLOCK' (0) is passed for a
block size then FFTW uses its default block size. To get the local size
of the data on each process, you should then call
'fftw_mpi_local_size_many_transposed'.