Facebook has written a Fast Fourier Transform (fbfft) that is 1.5x faster than the NVIDIA CUFFT implementation at sizes 8-64. The paper “Fast Convolutional Nets with fbfft: A GPU Performance Evaluation” discusses the performance increases by changing to a non-zero padded FFT layout (potentially eliminating data copies), the use of autotuning, and clipping to conditionally load a value (that allows for more efficient control flow rather than using explicit loop prologues and epilogues).
The Facebook AI research authors, Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino and Yann LeCun, “observed an overall mean speedup of 1.51× with standard deviation 0.21 and geometric mean 1.49×. The minimum speedup was 1.21×, despite sometimes performing more computations with fbfft which can only interpolate to a power of 2. These experiments exercise the zero-copy padding and lower memory footprints of fbfft compared to cuFFT.” The authors are working on additional optimizations such as tiling and bit twiddling elision.
For more information, see the arxiv.org paper, “Fast Convolutional Nets with fbfft: A GPU Performance Evaluation” or the Facebook github repository for fbcudnn.