fftw3: Multi-dimensional MPI DFTs of Real Data

 
 6.5 Multi-dimensional MPI DFTs of Real Data
 ===========================================
 
 FFTW's MPI interface also supports multi-dimensional DFTs of real data,
 similar to the serial r2c and c2r interfaces.  (Parallel one-dimensional
 real-data DFTs are not currently supported; you must use a complex
 transform and set the imaginary parts of the inputs to zero.)
 
    The key points to understand for r2c and c2r MPI transforms (compared
 to the MPI complex DFTs or the serial r2c/c2r transforms), are:
 
    * Just as for serial transforms, r2c/c2r DFTs transform n[0] x n[1] x
      n[2] x ...  x n[d-1] real data to/from n[0] x n[1] x n[2] x ...  x
      (n[d-1]/2 + 1) complex data: the last dimension of the complex data
      is cut in half (rounded down), plus one.  As for the serial
      transforms, the sizes you pass to the 'plan_dft_r2c' and
      'plan_dft_c2r' are the n[0] x n[1] x n[2] x ...  x n[d-1]
      dimensions of the real data.
 
    * Although the real data is _conceptually_ n[0] x n[1] x n[2] x ...
      x n[d-1] , it is _physically_ stored as an n[0] x n[1] x n[2] x ...
      x [2 (n[d-1]/2 + 1)] array, where the last dimension has been
      _padded_ to make it the same size as the complex output.  This is
      much like the in-place serial r2c/c2r interface (See
      Multi-Dimensional DFTs of Real Data), except that in MPI the
      padding is required even for out-of-place data.  The extra padding
      numbers are ignored by FFTW (they are _not_ like zero-padding the
      transform to a larger size); they are only used to determine the
      data layout.
 
    * The data distribution in MPI for _both_ the real and complex data
      is determined by the shape of the _complex_ data.  That is, you
      call the appropriate 'local size' function for the n[0] x n[1] x
      n[2] x ...  x (n[d-1]/2 + 1) complex data, and then use the _same_
      distribution for the real data except that the last complex
      dimension is replaced by a (padded) real dimension of twice the
      length.
 
    For example suppose we are performing an out-of-place r2c transform
 of L x M x N real data [padded to L x M x 2(N/2+1) ], resulting in L x M
 x N/2+1 complex data.  Similar to the example in See2d MPI example,
 we might do something like:
 
      #include <fftw3-mpi.h>
 
      int main(int argc, char **argv)
      {
          const ptrdiff_t L = ..., M = ..., N = ...;
          fftw_plan plan;
          double *rin;
          fftw_complex *cout;
          ptrdiff_t alloc_local, local_n0, local_0_start, i, j, k;
 
          MPI_Init(&argc, &argv);
          fftw_mpi_init();
 
          /* get local data size and allocate */
          alloc_local = fftw_mpi_local_size_3d(L, M, N/2+1, MPI_COMM_WORLD,
                                               &local_n0, &local_0_start);
          rin = fftw_alloc_real(2 * alloc_local);
          cout = fftw_alloc_complex(alloc_local);
 
          /* create plan for out-of-place r2c DFT */
          plan = fftw_mpi_plan_dft_r2c_3d(L, M, N, rin, cout, MPI_COMM_WORLD,
                                          FFTW_MEASURE);
 
          /* initialize rin to some function my_func(x,y,z) */
          for (i = 0; i < local_n0; ++i)
             for (j = 0; j < M; ++j)
               for (k = 0; k < N; ++k)
             rin[(i*M + j) * (2*(N/2+1)) + k] = my_func(local_0_start+i, j, k);
 
          /* compute transforms as many times as desired */
          fftw_execute(plan);
 
          fftw_destroy_plan(plan);
 
          MPI_Finalize();
      }
 
    Note that we allocated 'rin' using 'fftw_alloc_real' with an argument
 of '2 * alloc_local': since 'alloc_local' is the number of _complex_
 values to allocate, the number of _real_ values is twice as many.  The
 'rin' array is then local_n0 x M x 2(N/2+1) in row-major order, so its
 '(i,j,k)' element is at the index '(i*M + j) * (2*(N/2+1)) + k' (See
 Multi-dimensional Array Format).
 
    As for the complex transforms, improved performance can be obtained
 by specifying that the output is the transpose of the input or vice
 versa (SeeTransposed distributions).  In our L x M x N r2c example,
 including 'FFTW_TRANSPOSED_OUT' in the flags means that the input would
 be a padded L x M x 2(N/2+1) real array distributed over the 'L'
 dimension, while the output would be a M x L x N/2+1 complex array
 distributed over the 'M' dimension.  To perform the inverse c2r
 transform with the same data distributions, you would use the
 'FFTW_TRANSPOSED_IN' flag.