fftw3: SIMD alignment and fftw_malloc
3.1 SIMD alignment and fftw_malloc
==================================
SIMD, which stands for "Single Instruction Multiple Data," is a set of
special operations supported by some processors to perform a single
operation on several numbers (usually 2 or 4) simultaneously. SIMD
floating-point instructions are available on several popular CPUs:
SSE/SSE2/AVX/AVX2/AVX512/KCVI on some x86/x86-64 processors, AltiVec and
VSX on some POWER/PowerPCs, NEON on some ARM models. FFTW can be
compiled to support the SIMD instructions on any of these systems.
A program linking to an FFTW library compiled with SIMD support can
obtain a nonnegligible speedup for most complex and r2c/c2r transforms.
In order to obtain this speedup, however, the arrays of complex (or
real) data passed to FFTW must be specially aligned in memory (typically
16-byte aligned), and often this alignment is more stringent than that
provided by the usual 'malloc' (etc.) allocation routines.
In order to guarantee proper alignment for SIMD, therefore, in case
your program is ever linked against a SIMD-using FFTW, we recommend
allocating your transform data with 'fftw_malloc' and de-allocating it
with 'fftw_free'. These have exactly the same interface and behavior as
'malloc'/'free', except that for a SIMD FFTW they ensure that the
returned pointer has the necessary alignment (by calling 'memalign' or
its equivalent on your OS).
You are not _required_ to use 'fftw_malloc'. You can allocate your
data in any way that you like, from 'malloc' to 'new' (in C++) to a
fixed-size array declaration. If the array happens not to be properly
aligned, FFTW will not use the SIMD extensions.
Since 'fftw_malloc' only ever needs to be used for real and complex
arrays, we provide two convenient wrapper routines 'fftw_alloc_real(N)'
and 'fftw_alloc_complex(N)' that are equivalent to
'(double*)fftw_malloc(sizeof(double) * N)' and
'(fftw_complex*)fftw_malloc(sizeof(fftw_complex) * N)', respectively (or
their equivalents in other precisions).