fftw3: One-dimensional distributions

 
 6.4.4 One-dimensional distributions
 -----------------------------------
 
 For one-dimensional distributed DFTs using FFTW, matters are slightly
 more complicated because the data distribution is more closely tied to
 how the algorithm works.  In particular, you can no longer pass an
 arbitrary block size and must accept FFTW's default; also, the block
 sizes may be different for input and output.  Also, the data
 distribution depends on the flags and transform direction, in order for
 forward and backward transforms to work correctly.
 
      ptrdiff_t fftw_mpi_local_size_1d(ptrdiff_t n0, MPI_Comm comm,
                      int sign, unsigned flags,
                      ptrdiff_t *local_ni, ptrdiff_t *local_i_start,
                      ptrdiff_t *local_no, ptrdiff_t *local_o_start);
 
    This function computes the data distribution for a 1d transform of
 size 'n0' with the given transform 'sign' and 'flags'.  Both input and
 output data use block distributions.  The input on the current process
 will consist of 'local_ni' numbers starting at index 'local_i_start';
 e.g.  if only a single process is used, then 'local_ni' will be 'n0' and
 'local_i_start' will be '0'.  Similarly for the output, with 'local_no'
 numbers starting at index 'local_o_start'.  The return value of
 'fftw_mpi_local_size_1d' will be the total number of elements to
 allocate on the current process (which might be slightly larger than the
 local size due to intermediate steps in the algorithm).
 
    As mentioned above (SeeLoad balancing), the data will be divided
 equally among the processes if 'n0' is divisible by the _square_ of the
 number of processes.  In this case, 'local_ni' will equal 'local_no'.
 Otherwise, they may be different.
 
    For some applications, such as convolutions, the order of the output
 data is irrelevant.  In this case, performance can be improved by
 specifying that the output data be stored in an FFTW-defined "scrambled"
 format.  (In particular, this is the analogue of transposed output in
 the multidimensional case: scrambled output saves a communications
 step.)  If you pass 'FFTW_MPI_SCRAMBLED_OUT' in the flags, then the
 output is stored in this (undocumented) scrambled order.  Conversely, to
 perform the inverse transform of data in scrambled order, pass the
 'FFTW_MPI_SCRAMBLED_IN' flag.
 
    In MPI FFTW, only composite sizes 'n0' can be parallelized; we have
 not yet implemented a parallel algorithm for large prime sizes.