fftw3: Load balancing

 
 6.4.2 Load balancing
 --------------------
 
 Ideally, when you parallelize a transform over some P processes, each
 process should end up with work that takes equal time.  Otherwise, all
 of the processes end up waiting on whichever process is slowest.  This
 goal is known as "load balancing."  In this section, we describe the
 circumstances under which FFTW is able to load-balance well, and in
 particular how you should choose your transform size in order to load
 balance.
 
    Load balancing is especially difficult when you are parallelizing
 over heterogeneous machines; for example, if one of your processors is a
 old 486 and another is a Pentium IV, obviously you should give the
 Pentium more work to do than the 486 since the latter is much slower.
 FFTW does not deal with this problem, however--it assumes that your
 processes run on hardware of comparable speed, and that the goal is
 therefore to divide the problem as equally as possible.
 
    For a multi-dimensional complex DFT, FFTW can divide the problem
 equally among the processes if: (i) the _first_ dimension 'n0' is
 divisible by P; and (ii), the _product_ of the subsequent dimensions is
 divisible by P. (For the advanced interface, where you can specify
 multiple simultaneous transforms via some "vector" length 'howmany', a
 factor of 'howmany' is included in the product of the subsequent
 dimensions.)
 
    For a one-dimensional complex DFT, the length 'N' of the data should
 be divisible by P _squared_ to be able to divide the problem equally
 among the processes.