- •Introduction
- •Standard Radix-2 FFT Algorithm
- •Figure 1. Standard Structure of the 16-Point FFT
- •Listing 1. fft32_unoptimized.asm
- •Table 1. Core Clock Cycles for N-point Complex FFT
- •Figure 2. Reorganized Structure of the 16-Point FFT
- •Pipelining of the Algorithm
- •Figure 3. Reorganized Structure’s Dependencies
- •Table 6. Pipelined Butterflies
- •The Code
- •Listing 2. fft32.asm - fragment
- •Usage Rules
- •Appendix
- •Complete Source Code of the Optimized FFT
- •Listing 3. fft32.asm
- •References
- •Document History
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
a |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ADSP-TS101 |
ADSP-TS101 |
|
ADSP-TS201 |
|
|
ADSP-TS201 |
|
|
ADSP-TS201 |
|
|
ADSP-TS201 |
|
|
N |
|
Old structure |
New structure |
|
Old structure- |
|
|
New structure- |
|
|
Old structure- |
|
|
New structure- |
|
|
|
|
|
|
Input not in |
|
|
Input not in |
|
|
Input in cache |
|
|
Input in cache |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
cache |
|
|
cache |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
256 |
|
2172 |
1958 |
2641 |
|
2402 |
|
2218 |
|
1963 |
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
512 |
|
4582 |
4276 |
|
5533 |
|
|
5192 |
|
|
4649 |
|
|
4283 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1024 |
|
9872 |
9410 |
|
12170 |
|
|
11662 |
|
|
9992 |
|
|
9419 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2048 |
|
21338 |
20688 |
|
26610 |
|
|
25316 |
|
|
22173 |
|
|
20699 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4096 |
|
46244 |
45278 |
|
197272 |
|
|
69924 |
|
|
NA |
|
|
NA |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8192 |
|
99886 |
98540 |
|
444628 |
|
|
147628 |
|
|
NA |
|
|
NA |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16384 |
|
215224 |
213243 |
987730 |
|
313292 |
|
|
NA |
|
NA |
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
32768 |
|
NA |
NA |
2133220 |
|
662614 |
|
|
NA |
|
NA |
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
65536 |
|
NA |
NA |
|
4720010 |
|
|
1397544 |
|
|
NA |
|
|
NA |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 7. Core Clock Cycles for N-point Complex FFT, New versus Old Structure
Usage Rules
The C-callable complex FFT routine is called as
FFT32(&(input), &(ping_pong_buffer1), &(ping_pong_buffer2), &(output), N, F);
where
input -> FFT input buffer, output -> FFT output buffer,
ping_pong_bufferx are the ping pong buffers, N=Number of complex points,
F=0 if FFT is real and 1 if FFT is complex.
As mentioned earlier, due to data re-ordering, stages cannot be done in-place and have to pingpong. Thus, ping_pong_buffer1 and ping_pong_buffer2 have to be two distinct buffers. However, depending on the routine’s user requirements, some memory optimization is possible. Ping_pong_buffer1 can be made the
same as input if input does not need to be preserved. Also, if Log2(N) is even, output can be made the same as ping_pong_buffer2 and if Log2(N) is odd, output can be made the same as ping_pong_buffer1. Below are two examples of the routine usage with minimal use of memory:
FFT32(&(input), &( input),
&( output), &(output), 1024, 1);
FFT32(&(input), &(input),
&( ping_pong_buffer2), &( input), 2048, 1);
To eliminate memory block access conflicts, input must reside in a different memory block than ping_pong_buffer2 and twiddle factors must reside in a different memory block than the pingpong buffers. Of course, all code must reside in a block that is different from all the data buffers, as well. Ping-pong buffers can share a memory block, however – there is no instruction that accesses both ping-pong buffers in the same cycle.
Writing Efficient Floating-Point FFTs for ADSP-TS201 TigerSHARC® Processors (EE-218) |
Page 10 of 16 |
