简体   繁体   English

将 FMA 指令用于 FFT 算法

[英]Using FMA instructions for an FFT algorithm

I have a bit of C++ code that has become a somewhat useful FFT library over time, and it has been made to run decently fast using SSE and AVX instructions.我有一些 C++ 代码随着时间的推移已经成为一个有点有用的 FFT 库,并且使用 SSE 和 AVX 指令使其运行得非常快。 Granted, it's all only based on a radix-2 algorithm, but it still holds up.诚然,这一切都仅基于 radix-2 算法,但它仍然成立。 My latest itch to scratch is making the butterfly calculations work with FMA instructions.我最近最想从头开始是使蝴蝶计算与 FMA 指令一起工作。 The basic radix-2 butterfly consists of 4 multiplies, and 6 additions or subtractions.基本的基数 2 蝴蝶由 4 个乘法和 6 个加法或减法组成。 A simple approach would involve replacing 2 of the additions and subtractions and 2 multiplies with 2 FMA instructions, resulting in a mathematically identical butterfly, but there are apparently better ways of doing this:一种简单的方法是用 2 个 FMA 指令替换 2 个加法和减法以及 2 个乘法,从而产生数学上相同的蝴蝶,但显然有更好的方法来做到这一点:

https://books.google.com/books?id=2HG0DwAAQBAJ&pg=PA56&lpg=PA56&dq=radix+2+fft+fma&source=bl&ots=R5XDWyYBVv&sig=ACfU3U0S2n1hcgiP63LTKMxI5Oc85eEZaQ&hl=en&sa=X&ved=2ahUKEwiz_I3PsrToAhVoHzQIHYmVDGIQ6AEwDXoECAoQAQ#v=onepage&q=radix%202%20fft%20fma&f=false https://books.google.com/books?id=2HG0DwAAQBAJ&pg=PA56&lpg=PA56&dq=radix+2+fft+fma&source=bl&ots=R5XDWyYBVv&sig=ACfU3U0S2n1hcgiP63LTKMxI5Oc85eEZaQ&hl=en&sa=X&ved=2ahUKEwiz_I3PsrToAhVoHzQIHYmVDGIQ6AEwDXoECAoQAQ#v=onepage&q=radix%202%20fft% 20fma&f=假

ci1 = ci1 / cr1
u0 = zinr(0)
v0 = zini(0)
r = zinr(1)
s = sini(1)
u1 = r - s * ci1
v1 = r * ci1 + s
zoutr(0) = u0 + u1 * cr1
zouti(0) = v0 + v1 * cr1
zoutr(1) = u0 - u1 * cr1
zouti(1) = v0 - v1 * cr1

The author replaces all 10 adds, subs, and mults with 6 FMA's, provided that the imaginary part of the twiddle factor is divided by the real part.作者用 6 个 FMA 替换了所有 10 个加法、减法和乘法,前提是旋转因子的虚部除以实部。 Part of the text reads "Note that cr1 != 0".部分文字为“注意 cr1 != 0”。 Which is essentially my problem in a nutshell.简而言之,这基本上是我的问题。 The math seems to work just as advertised for all twiddle factors except when the real twiddle is zero, in which case, we end up dividing by zero.数学似乎对所有旋转因子都有效,除非真正的旋转因子为零,在这种情况下,我们最终除以零。 Where efficiency is absolutely critical here, branching code when cr1 == 0 to a different butterfly isn't a good option, especially when we're using SIMD to process multiple twiddles and butterflies at once, where perhaps only one element of cr1 == 0. What my gut is telling me should be the case, is that when cr1 == 0, cr1 and ci1 should be some other values entirely and the FMA code will still result in the correct answer, but I cannot seem to figure this out.在这里效率绝对至关重要,当 cr1 == 0 时将代码分支到不同的蝴蝶不是一个好的选择,尤其是当我们使用 SIMD 一次处理多个旋转和蝴蝶时,其中可能只有 cr1 == 的一个元素0. 我的直觉告诉我应该是这样,当 cr1 == 0,cr1 和 ci1 应该完全是其他一些值,FMA 代码仍然会产生正确的答案,但我似乎无法弄清楚这一点. If I could figure it out, it would be a relatively straightforward thing to modify the precomputed twiddle factors for FMA butterflies and we also could, of course, avoid the division operation at the start of the butterfly.如果我能弄清楚,修改 FMA 蝴蝶的预先计算的旋转因子将是一件相对简单的事情,我们当然也可以避免蝴蝶开始时的除法运算。

The book seems to suggest that cr1 != 0 is always true.这本书似乎暗示cr1 != 0总是正确的。 But unfortunately, it is not always the case (when the rotation angle is PI/2).但不幸的是,情况并非总是如此(当旋转角度为 PI/2 时)。

I don't think that you can solve this by adjusting the twiddle factors.我不认为你可以通过调整旋转因子来解决这个问题。 The only option I see is to use some very small number instead of zero.我看到的唯一选择是使用一些非常小的数字而不是零。 It could work, but it's ugly, and it may cause inaccuracies in certain cases.它可以工作,但它很丑陋,并且在某些情况下可能会导致不准确。

Possible solutions:可能的解决方案:

  • Split the loop into two, and handle this center case (where division by zero happens) specially将循环分成两部分,并专门处理这个中心情况(发生除以零的情况)
  • Instead of dividing by cr1 , divide by ci1 , and modify the forumula accordingly.而不是除以cr1 ,除以ci1 ,并相应修改forumula。 This case still has a divison by zero, but it will happen at the first iteration of the loop.这种情况仍然有一个被零除,但它会在循环的第一次迭代中发生。 So instead of the center, you have to handle the first iteration specially (so only one loop is needed).因此,您必须专门处理第一次迭代而不是中心(因此只需要一个循环)。
  • Use a different FMA formulation:使用不同的 FMA 公式:

Notice, that:请注意:

zoutr(1) = u0 - u1 
         = u0 - u1 - (u0 + u1) + (u0 + u1) 
         = u0 - u1 - zoutr(0) + u0 + u1 
         = 2*u0 - zoutr(0)

So, this operation can be done in 1 FMA.因此,此操作可以在 1 FMA 中完成。

And if you substitute u1 into the expression of zoutr(0) :如果将u1代入zoutr(0)的表达式:

zoutr(0) = u0 + u1
         = u0 + r*cr1 - s*ci1

This can be done with 2 FMAs.这可以通过 2 个 FMA 来完成。

Calculating zouti can be done in the same manner as zoutr .计算zouti的方法与zoutr相同。 So this way you need to use 6 FMA operations, which is the same amount of operations that the book has.所以这种方式需要用到6个FMA操作,跟书上的操作量是一样的。

(Note, this doesn't mean that this variant will run faster automatically, as it has a different data dependency chain) (注意,这并不意味着这个变体会自动运行得更快,因为它有不同的数据依赖链)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM