简体   繁体   English

在64位x 64位乘法中使用Karatsuba算法真的很有效吗?

[英]Is it really efficient to use Karatsuba algorithm in 64-bit x 64-bit multiplication?

I work on AVX2 and need to calculate 64-bit x64-bit -> 128-bit widening multiplication and got 64-bit high part in the fastest manner. 我在AVX2上工作,需要计算64位x64位 - > 128位加宽乘法,并以最快的方式获得64位高位。 Since AVX2 has not such an instruction, is it reasonable for me to use Karatsuba algorithm for efficiency and gaining speed? 由于AVX2没有这样的指令,使用Karatsuba算法提高效率和提高速度是否合理?

No. On modern architectures the crossover at which Karatsuba beats schoolbook multiplication is usually somewhere between 8 and 24 machine words (eg between 512 and 1536 bits on x86_64). 在现代架构中,Karatsuba击败教科书倍增的交叉通常介于8到24个机器字之间(例如x86_64上的512到1536位之间)。 For fixed sizes, the threshold is at the smaller end of that range, and the new ADCX/ADOX instructions likely bring it in somewhat further for scalar code, but 64x64 is still too small to benefit from Karatsuba. 对于固定大小,阈值处于该范围的较小端,并且新的ADCX / ADOX指令可能使标量代码稍微进一步,但64x64仍然太小而无法从Karatsuba中受益。

It's highly unlikely that AVX2 will beat the mulx instruction which does 64bx64b to 128b in one instruction. AVX2极不可能击败在mulx指令中执行64bx64b到128b的mulx指令。 There is one exception I'm aware of large multiplications using floating point FFT . 有一个例外我知道使用浮点FFT的大乘法

However, if you don't need exactly 64bx64b to 128b you could consider 53bx53b to 106b using double-double arithmetic . 但是,如果您不需要64bx64b到128b,则可以使用双倍算法考虑53bx53b到106b。

To multiply four 53-bit numbers a and b to get four 106-bit number only needs two instructions: 要将四个53位数字ab相乘以得到四个106位数字,只需要两条指令:

__m256 p = _mm256_mul_pd(a,b);
__m256 e = _mm256_fmsub_pd(a,b,p);

This gives four 106-bit numbers in two instructions compared to one 128-bit number in one instruction using mulx . 与使用mulx一条指令中的一个128位数相比,这在两条指令中给出了4个106 mulx

It's hard to tell without trying, but it might me faster to just use the AMD64 MUL instruction, which supports 64x64=128 with the same throughput as most AVX2 instructions (but not vectorized). 没有尝试就很难说,但是使用AMD64 MUL指令可能会更快,它支持64x64 = 128,吞吐量与大多数AVX2指令相同(但不是矢量化)。 The drawback is that you need to load to regular registers if the operands were in YMM registers. 缺点是如果操作数在YMM寄存器中,则需要加载到常规寄存器。 That would give something like LOAD + MUL + STORE for a single 64x64=128. 这样就可以为单个64x64 = 128提供类似LOAD + MUL + STORE功能。

If you can vectorize Karatsuba in AVX2, try both AVX2 and MUL and see which is faster. 如果您可以在AVX2中矢量化Karatsuba,请尝试AVX2和MUL ,看看哪个更快。 If you can't vectorize, single MUL will probably be faster. 如果你不能矢量化,单个MUL可能会更快。 If you can remove the load and store to regular registers, single MUL will be definitely faster. 如果你可以删除加载并存储到常规寄存器,单个MUL肯定会更快。

Both MUL and AVX2 instructions can have an operand in memory with the same throughput, and it may help to remove one load for MUL . MUL和AVX2指令都可以在内存中具有相同吞吐量的操作数,并且可以帮助移除MUL一个负载。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM