BeagleBoard/GSoC/2010 Projects/FFTW

Project: NEON Support for FFTW
Student: Christopher Friedt

Mentors: Mans Rullgard, Philip Balister

Repositories:

http://gitorious.org/gsoc2010-fftw-neon

http://gitorious.org/gsoc2010-fftw-neon-misc

Blog: http://gsoc2010-fftw-neon.blogspot.com/

Latest blog entries: http://gsoc2010-fftw-neon.blogspot.com/feeds/posts/default|charset=UTF-8|max=3

Overview
FFTW is perhaps one of the most widely used fast Fourier transforms available today. It builds open decades of research and innovation in spectral analysis to provide a reliable, adaptive and extensible DFT / DCT / DST / RDFT solution. FFTW provides the backend for both the Matlab and GNU Octave "fft" commands, and FFTW has been integrated into other applications such as GNURadio.

Historically, FFTW was only optimized for personal-computers based on the x86 or PowerPC architectures. Recently, support for the Cell Broadband Engine was also added. However, there was a distinct lack of ARM support. Since ARM devices have begun approaching the GHz range, they have become more suitable for personal computing devices - not just embedded controllers. The primary reason for the delay, was that it was traditionally rare to find ARM processors with a dedicated floating point (FP) unit (essentially a prerequisite for FFTW).

However, with the introduction of the ARMv7 architecture and the Advanced SIMD Extensions (NEON), performing high-throughput floating-point processing became a serious reality. BeagleBoard made it possible and affordable for the masses to start tinkering with Cortex-A8 and NEON with the OMAP3 series of SoCs from Texas Instruments (who continues to be a world-class leader in producing ARM silicon... wink, wink).

Motivation
I've been working with ARM devices for several years, and have never lost one iota of interest. However, most of the devices I'd used for work (mainly ARMv4t and ARMv5te) were distinctly less powerful than the OMAP3. So once I heard about the project, I picked up a BeagleBoard almost as soon as they became available. As next generation chips become available (i.e. the OMAP4 series), the availability of ARM-based personal computing devices is coming closer and closer. For example, although Intel-based netbooks are great, I always found that they consumed too much power, or the fan was too loud, or the became too hot, or that they were just too bulky for my purposes. In my opinion, a fanless, low-power ARM chip would most certainly dominate the netbook / tablet market. Furthermore, ever since I started my bachelor's degree, started working, and then subsequently started my master's, I always thought it would be super-cool to have the best graphing calculator money could buy for doing things like spectral analysis. Seriously - how cool would it be to just draw your mobile and do a quick simulation and spectral analysis of a radio channel or to provide a real-time scope reflecting the sounds that are all around you!? I thought it would be pretty cool, in any case (yes, i am a nerd).

Like any self-respecting open-source junkie embedded system's engineer, I immediately thought of GNU Octave and FFTW.

Hence, the goal of this project was to speed up FFTW performance on NEON-enabled arm devices. That involved three primary feature additions and a demonstration.

List of Goals

 * 1) extending the FFTW SIMD interface to support the NEON instruction set
 * 2) adding a performance counter, so that the FFTW planner could accurately determine which algorithm was faster (rather than using approximation methods)
 * 3) adding code to produce faster transforms
 * 4) provide a demonstratable speedup (i.e. graph, screencast with GNU Octave)

Status
GSOC 2010 is winding down to an end.


 * 1) FFTW SIMD Interface (complete)
 * 2) Performance Counter (complete)
 * 3) Speed, Speed and more Speed (approximately 80% complete)
 * 4) Demonstration (75% comoplete)

FFTW SIMD Interface
The NEON SIMD interface that I implemented can be configured with the '--enable-neon' option for FFTW. By default, it uses hand-optimized inline-assembler routines, but you can change that with '--enable-neon-intrinsics' if you would prefer (discouraged). The routines are used by FFTW's in dft/simd/codelets and rdft/simd/codelets.

Performance Counter
The cycle counter (sometimes called performance counter) was originally given to me in a demo app by my mentor, Mans, and I basically just had to coerce it into the FFTW codebase and ensure that it was getting turned off and on correctly during regular FFTW usage (which required a bit of debugging). You can enable the cycle counter with '--enable-armv7a-cycle-counter'. The cycle counter is crucial to having an optimally performing FFTW library. If it is not enabled, then FFTW purely uses estimation methods for optimization.

Speed, Speed and more Speed
Speed. That's what is most important after all, right? I spent quite a bit of time at the beginning of the project familiarizing myself with the many intricacies of the NEON instruction set, getting the correct habits down for alignment, scheduling, and so on to produce the best throughput for parallel arithmetic. After implementing the SIMD interface in FFTW, the initial results were not particularly impressive and only made for a 1.5x to 2x increase in MFLOPS (as estimated by benchfft-3.1). I was aiming for a 10x increase in performance. To better gauge the potential, I put together an interface for FFTW to use the well-known and very fast power-of-two (POT) transforms of FFMPEG (see libavcodec/avfft.h from the ffmpeg git repository). This was done in an architecture-agnostic manner so that it could be used on any supported platform (not just ARM / NEON).

The FFMPEG routines blew way past what my SIMD interface had done which was foreseeable, since all of the ffmpeg routines for NEON were written in hand-crafted assembler. Rather late in my project, I realized that the bottleneck was really poorly implemented memory copies. To be a bit clearer, you might need to read up about the Cooley-Tukey algorithm (basics of the mixed-radix DFT). Essentially, when one computes the DFT of, say, a length N signal using two composite transforms of length N1 and N2 such that N1*N2=N, the input for N1 and N2 transforms are not contiguous in memory. The solution to that problem is to either a) copy the input to a contiguous section of memory, or b) use somewhat more complicated indexing routines. However, since option 'b' could have different indexing systems for virtually any composite transform, since it could also incur cache penalties, and lastly since it maked life very difficult for writing efficient SIMD code, option 'a' is really the only alternative. So after implementing the SIMD interface, the resulting problem boiled down to using NEON instructions to optimize memory transfers to and from contiguous regions (what FFTW calls a rank-0 transform). I was working on that part up until the last day, and most of the groundwork is laid out, but I just didn't have time to get to the end (my MSc thesis demanded the major part of my time).

So, as it stands, POT transforms are ~10x faster because they simply fire-off a call to FFMPEG's routines, but non-power-of-two transforms (NPOT, which FFTW is specifically renowned for) only exhibit the increase gained from my NEON SIMD interface at ~2x. The good news is, that the NPOT transform algorithms generally use a recursive or composite strategy to calculate the transform using composites of small prime numbers (such as 2, 3, 5, 7, 11, 13, etc). Actually, the NPOT algorithms that FFTW utilizes will theoretically work at the same approximate asymptotic complexity as the POT algorithms. So, once again, the problem simply boiled down to optimizing rank-0 transforms (strided memory copies).

Demonstration
The demonstration basically consists of graphs showing performance (see my project blog) and ultimately a screencast or video to show the real-world applicability of NEON optimizations to something like GNU Octave.

As for the video demonstration, within the next few days I'm planning on putting together a small screencast to show a real-world application (yaay!) using GNU Octave. The time it will take for the 'fft' command to execute will then be noticably less, which is great if you're planning to use GNU Octave on a mobile device. Naturally, Octave has about a billion other things that still need to be optimized for ARM / NEON, but that is a completely other project.

Build Instructions
coming soon

Why would someone want to go through FFTW just to get to the good stuff in FFMPEG?
That's a really good question, and I have an equally good answer. The main drawback of using FFMPEG is that it only works with POT transforms (and works really well at that). However, as you may find when working with very large datasets or multidimensional data (neither of which FFMPEG provides a handle for) you will quickly find out that your will run out of memory by using the traditional POT algorithms, because the only way to compute a NPOT transform with FFMPEG is to zero-pad it to a POT length. So, for example, if I wanted to compute the transform of a length-32769 ((2^15)+1) signal without getting the side effects of windowing, FFMPEG would require that I pad the signal to a length of 65536! This can have a major impact when one has limited memory resources (and cache size), like on the BeagleBoard or virtually any other ARM-based device.

On the other hand, FFTW has the ability to split the work of a length-32769 transform into composite transforms of 3, 3, 11, and 331. Recursively, FFTW can also reduce the prime-length-331 transform into smaller sized transforms using Rader's algorithm. FFTW continues this recursion, measuring the execution time of several different strategies (not unlike Dykstra's algorithm of computing the shortest path through a directed graph), until it reduces the problem to several, small, prime-numbered composite problems. In this case, it could fire-off a POT problem to FFMPEG or use the internal codelets to compute a length-3 composite transform, or virtually any other algorithm that it sees fit. FFTW then becomes a dispatcher of sorts, and a compiler of fast DFT algorithms, even for prime and NPOT lengths. Furthermore, it then stores the strategy so that if you have to perform the same computation in the future, you already have a plan to do it quickly.

Really Dirty Technical Details
coming soon