fftw3: Load balancing
6.4.2 Load balancing
--------------------
Ideally, when you parallelize a transform over some P processes, each
process should end up with work that takes equal time. Otherwise, all
of the processes end up waiting on whichever process is slowest. This
goal is known as "load balancing." In this section, we describe the
circumstances under which FFTW is able to load-balance well, and in
particular how you should choose your transform size in order to load
balance.
Load balancing is especially difficult when you are parallelizing
over heterogeneous machines; for example, if one of your processors is a
old 486 and another is a Pentium IV, obviously you should give the
Pentium more work to do than the 486 since the latter is much slower.
FFTW does not deal with this problem, however--it assumes that your
processes run on hardware of comparable speed, and that the goal is
therefore to divide the problem as equally as possible.
For a multi-dimensional complex DFT, FFTW can divide the problem
equally among the processes if: (i) the _first_ dimension 'n0' is
divisible by P; and (ii), the _product_ of the subsequent dimensions is
divisible by P. (For the advanced interface, where you can specify
multiple simultaneous transforms via some "vector" length 'howmany', a
factor of 'howmany' is included in the product of the subsequent
dimensions.)
For a one-dimensional complex DFT, the length 'N' of the data should
be divisible by P _squared_ to be able to divide the problem equally
among the processes.