One of the advantage of using fluidfft is to be able to use the fastest fft library for a particular problem and in a particular (super)computer.
We provide command-line utilities to easily run and analyze benchmarks. You can for example run the commands:
fluidfft-bench -h # 2d fluidfft-bench 1024 768 fluidfft-bench 1024 -d 2 mpirun -np 2 fluidfft-bench 1024 -d 2 # 3d fluidfft-bench 32 48 64 fluidfft-bench 128 -d 3 mpirun -np 2 fluidfft-bench 128 -d 3
Once you have run many benchmarks (to get statistics) for different numbers of processes (if you want to use MPI), you can analyze the results for example with:
fluidfft-bench-analysis 1024 -d 2
Benchmarks on Occigen¶
Occigen is a GENCI-CINES HCP cluster.
For every FFT classes available for the resolution and for the two tasks fft and ifft, three functions are used and compared (see the legends):
“fft_cpp” (continuous lines): benchmark of the C++ function from the C++ code. No memory allocation.
“fft_as_arg” (dashed lines): benchmark of a Python method
fft_as_argfrom Python. As for the C++ code, the second argument of this method is an array to contain the result of the transform, so no memory allocation is needed.
“fft_return” (dotted lines): benchmark of a Python method
fftfrom Python. No array is provided to the function to contain the result so a numpy array is created and then returned by the function.
The fastest methods are fftw1d (which is limited to 192 cores) and p3dfft.
The benchmark is not sufficiently accurate to measure the cost of calling the functions from Python (difference between continuous and dashed lines, i.e. between pure C++ and the “as_arg” Python method) and even the creation of the numpy array (difference between the dashed and the dotted line, i.e. between the “as_arg” and the “return” Python methods).
For this resolution, the fftw1d is also the fastest method when using only few cores and it can not be used for more that 576 cores. The faster library when using more cores is also p3dfft.
Benchmarks on Beskow¶
Benchmarks on a LEGI cluster¶
We run some benchmarks on Cluster8 (2015, 12 nodes Xeon DELL C6320, 20 cores per node).
We see that the scaling is not far from linear for intra-node computations. In contrast, the speedup is really bad for inter-node computations.