fftw3: Basic distributed-transpose interface
6.7.1 Basic distributed-transpose interface
-------------------------------------------
In particular, suppose that we have an 'n0' by 'n1' array in row-major
order, block-distributed across the 'n0' dimension. To transpose this
into an 'n1' by 'n0' array block-distributed across the 'n1' dimension,
we would create a plan by calling the following function:
fftw_plan fftw_mpi_plan_transpose(ptrdiff_t n0, ptrdiff_t n1,
double *in, double *out,
MPI_Comm comm, unsigned flags);
The input and output arrays ('in' and 'out') can be the same. The
transpose is actually executed by calling 'fftw_execute' on the plan, as
usual.
The 'flags' are the usual FFTW planner flags, but support two
additional flags: 'FFTW_MPI_TRANSPOSED_OUT' and/or
'FFTW_MPI_TRANSPOSED_IN'. What these flags indicate, for transpose
plans, is that the output and/or input, respectively, are _locally_
transposed. That is, on each process input data is normally stored as a
'local_n0' by 'n1' array in row-major order, but for an
'FFTW_MPI_TRANSPOSED_IN' plan the input data is stored as 'n1' by
'local_n0' in row-major order. Similarly, 'FFTW_MPI_TRANSPOSED_OUT'
means that the output is 'n0' by 'local_n1' instead of 'local_n1' by
'n0'.
To determine the local size of the array on each process before and
after the transpose, as well as the amount of storage that must be
allocated, one should call 'fftw_mpi_local_size_2d_transposed', just as
for a 2d DFT as described in the previous section:
ptrdiff_t fftw_mpi_local_size_2d_transposed
(ptrdiff_t n0, ptrdiff_t n1, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
Again, the return value is the local storage to allocate, which in
this case is the number of _real_ ('double') values rather than complex
numbers as in the previous examples.