fftw3: Transposed distributions

 
 6.4.3 Transposed distributions
 ------------------------------
 
 Internally, FFTW's MPI transform algorithms work by first computing
 transforms of the data local to each process, then by globally
 _transposing_ the data in some fashion to redistribute the data among
 the processes, transforming the new data local to each process, and
 transposing back.  For example, a two-dimensional 'n0' by 'n1' array,
 distributed across the 'n0' dimension, is transformd by: (i)
 transforming the 'n1' dimension, which are local to each process; (ii)
 transposing to an 'n1' by 'n0' array, distributed across the 'n1'
 dimension; (iii) transforming the 'n0' dimension, which is now local to
 each process; (iv) transposing back.
 
    However, in many applications it is acceptable to compute a
 multidimensional DFT whose results are produced in transposed order
 (e.g., 'n1' by 'n0' in two dimensions).  This provides a significant
 performance advantage, because it means that the final transposition
 step can be omitted.  FFTW supports this optimization, which you specify
 by passing the flag 'FFTW_MPI_TRANSPOSED_OUT' to the planner routines.
 To compute the inverse transform of transposed output, you specify
 'FFTW_MPI_TRANSPOSED_IN' to tell it that the input is transposed.  In
 this section, we explain how to interpret the output format of such a
 transform.
 
    Suppose you have are transforming multi-dimensional data with (at
 least two) dimensions n[0] x n[1] x n[2] x ...  x n[d-1] .  As always,
 it is distributed along the first dimension n[0] .  Now, if we compute
 its DFT with the 'FFTW_MPI_TRANSPOSED_OUT' flag, the resulting output
 data are stored with the first _two_ dimensions transposed: n[1] x n[0]
 x n[2] x ...  x n[d-1] , distributed along the n[1] dimension.
 Conversely, if we take the n[1] x n[0] x n[2] x ...  x n[d-1] data and
 transform it with the 'FFTW_MPI_TRANSPOSED_IN' flag, then the format
 goes back to the original n[0] x n[1] x n[2] x ...  x n[d-1] array.
 
    There are two ways to find the portion of the transposed array that
 resides on the current process.  First, you can simply call the
 appropriate 'local_size' function, passing n[1] x n[0] x n[2] x ...  x
 n[d-1] (the transposed dimensions).  This would mean calling the
 'local_size' function twice, once for the transposed and once for the
 non-transposed dimensions.  Alternatively, you can call one of the
 'local_size_transposed' functions, which returns both the non-transposed
 and transposed data distribution from a single call.  For example, for a
 3d transform with transposed output (or input), you might call:
 
      ptrdiff_t fftw_mpi_local_size_3d_transposed(
                      ptrdiff_t n0, ptrdiff_t n1, ptrdiff_t n2, MPI_Comm comm,
                      ptrdiff_t *local_n0, ptrdiff_t *local_0_start,
                      ptrdiff_t *local_n1, ptrdiff_t *local_1_start);
 
    Here, 'local_n0' and 'local_0_start' give the size and starting index
 of the 'n0' dimension for the _non_-transposed data, as in the previous
 sections.  For _transposed_ data (e.g.  the output for
 'FFTW_MPI_TRANSPOSED_OUT'), 'local_n1' and 'local_1_start' give the size
 and starting index of the 'n1' dimension, which is the first dimension
 of the transposed data ('n1' by 'n0' by 'n2').
 
    (Note that 'FFTW_MPI_TRANSPOSED_IN' is completely equivalent to
 performing 'FFTW_MPI_TRANSPOSED_OUT' and passing the first two
 dimensions to the planner in reverse order, or vice versa.  If you pass
 _both_ the 'FFTW_MPI_TRANSPOSED_IN' and 'FFTW_MPI_TRANSPOSED_OUT' flags,
 it is equivalent to swapping the first two dimensions passed to the
 planner and passing _neither_ flag.)