fftw3: Transposed distributions
6.4.3 Transposed distributions
------------------------------
Internally, FFTW's MPI transform algorithms work by first computing
transforms of the data local to each process, then by globally
_transposing_ the data in some fashion to redistribute the data among
the processes, transforming the new data local to each process, and
transposing back. For example, a two-dimensional 'n0' by 'n1' array,
distributed across the 'n0' dimension, is transformd by: (i)
transforming the 'n1' dimension, which are local to each process; (ii)
transposing to an 'n1' by 'n0' array, distributed across the 'n1'
dimension; (iii) transforming the 'n0' dimension, which is now local to
each process; (iv) transposing back.
However, in many applications it is acceptable to compute a
multidimensional DFT whose results are produced in transposed order
(e.g., 'n1' by 'n0' in two dimensions). This provides a significant
performance advantage, because it means that the final transposition
step can be omitted. FFTW supports this optimization, which you specify
by passing the flag 'FFTW_MPI_TRANSPOSED_OUT' to the planner routines.
To compute the inverse transform of transposed output, you specify
'FFTW_MPI_TRANSPOSED_IN' to tell it that the input is transposed. In
this section, we explain how to interpret the output format of such a
transform.
Suppose you have are transforming multi-dimensional data with (at
least two) dimensions n[0] x n[1] x n[2] x ... x n[d-1] . As always,
it is distributed along the first dimension n[0] . Now, if we compute
its DFT with the 'FFTW_MPI_TRANSPOSED_OUT' flag, the resulting output
data are stored with the first _two_ dimensions transposed: n[1] x n[0]
x n[2] x ... x n[d-1] , distributed along the n[1] dimension.
Conversely, if we take the n[1] x n[0] x n[2] x ... x n[d-1] data and
transform it with the 'FFTW_MPI_TRANSPOSED_IN' flag, then the format
goes back to the original n[0] x n[1] x n[2] x ... x n[d-1] array.
There are two ways to find the portion of the transposed array that
resides on the current process. First, you can simply call the
appropriate 'local_size' function, passing n[1] x n[0] x n[2] x ... x
n[d-1] (the transposed dimensions). This would mean calling the
'local_size' function twice, once for the transposed and once for the
non-transposed dimensions. Alternatively, you can call one of the
'local_size_transposed' functions, which returns both the non-transposed
and transposed data distribution from a single call. For example, for a
3d transform with transposed output (or input), you might call:
ptrdiff_t fftw_mpi_local_size_3d_transposed(
ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
Here, 'local_n0' and 'local_0_start' give the size and starting index
of the 'n0' dimension for the _non_-transposed data, as in the previous
sections. For _transposed_ data (e.g. the output for
'FFTW_MPI_TRANSPOSED_OUT'), 'local_n1' and 'local_1_start' give the size
and starting index of the 'n1' dimension, which is the first dimension
of the transposed data ('n1' by 'n0' by 'n2').
(Note that 'FFTW_MPI_TRANSPOSED_IN' is completely equivalent to
performing 'FFTW_MPI_TRANSPOSED_OUT' and passing the first two
dimensions to the planner in reverse order, or vice versa. If you pass
_both_ the 'FFTW_MPI_TRANSPOSED_IN' and 'FFTW_MPI_TRANSPOSED_OUT' flags,
it is equivalent to swapping the first two dimensions passed to the
planner and passing _neither_ flag.)