Fast 1D convolution with AVX
Convolutions are important in a variety of fields. For example, in deep learning, convolutional layers represent a critical building block for most signal processing: image, sound or both. My company Lokad is also extensively using convolutions as part of its own algebra of distribution.
One technical problem associated to convolutions is that they are slow. The theory tells you that FFT can be used to obtain good asymptotic performance, but FFT isn’t typically an option when your signal has only a few dozen data points; but that you still need to process tens of millions of such signals.
It is possible to speed-up convolutions with a GPU, but my latest experients also tells me that a massive speed-up can already be achieved with CPUs as well, by leveraging their now widespread vector instructions. This idea isn’t new, and I was inspired by an original post of Henry Gomersall on the very same topic.
I have just released my own fast 1D convolution in C++ which differs substantially from the original one posted by Henry Gomersall. More specifically:
- This implementation works with AVX (rather than SSE), for further performance boost.
- The approach can canonically be upgraded to AVX2 (or even larger vector instruction).
- It delivers full convolutions rather than exact ones (NumPy terminology).
Compared to the naive C#/.NET implementation of convolution - which was my starting point - this implementation delivers a rough 20x speed-up; but it comes at the cost of doing PInvoke from C#/.NET.
Sidenote for .NET enthusiats: AVX intrinsics are coming to .NET. It will soon be possible to write such ultra-optimized code directly from C#/.NET.