简体   繁体   English

ARM / NEON的64位/ 32位分区更快算法?

[英]64bit/32bit division faster algorithm for ARM / NEON?

I am working on a code in which at two places there are 64bit by 32 bit fixed point division and the result is taken in 32 bits. 我正在研究一个代码,其中两个地方有64位乘32位定点除法,结果取32位。 These two places are together taking more than 20% of my total time taken. 这两个地方共占用了我总时间的20%以上。 So I feel like if I could remove the 64 bit division, I could optimize the code well. 所以我觉得如果我能删除64位除法,我可以很好地优化代码。 In NEON we can have some 64 bit instructions. 在NEON中,我们可以有一些64位指令。 Can any one suggest some routine to get the bottleneck resolved by using some faster implementation. 任何人都可以建议通过使用更快的实现来解决瓶颈问题。

Or if I could make the 64 bit/32 bit division in terms of 32bit/32 bit division in C, that also is fine? 或者如果我可以用C中的32位/ 32位除法进行64位/ 32位除法,那也没关系?

If any one has some idea, could you please help me out? 如果有人有任何想法,你能帮帮我吗?

I did a lot of fixed-point arithmetic in the past and did a lot of research looking for fast 64/32 bit divisions myself. 我过去做了很多定点运算,并且自己做了大量的研究,寻找快速的64/32位分区。 If you google for 'ARM division' you will find tons of great links and discussion about this issue. 如果谷歌“ARM师”,你会发现的这个问题很大的联系和讨论。

The best solution for ARM architecture, where even a 32 bit division may not be available in hardware is here: ARM架构的最佳解决方案,即使32位除法可能无法在硬件中使用,也可以在此处:

http://www.peter-teichmann.de/adiv2e.html http://www.peter-teichmann.de/adiv2e.html

This assembly code is very old, and your assembler may not understand the syntax of it. 这个汇编代码老,你的汇编程序可能无法理解它的语法。 It is however worth porting the code to your toolchain. 但是,值得将代码移植到您的工具链中。 It is the fastest division code for your special case I've seen so far, and trust me: I've benchmarked them all :-) 这是我迄今为止看到的特殊案例中最快的分区代码,请相信我:我对它们进行了基准测试:-)

Last time I did that (about 5 years ago, for CortexA8) this code was about 10 times faster than what the compiler generated. 上次我这样做(大约5年前,对于CortexA8),这段代码比编译器生成的代码快10倍。

This code doesn't use NEON. 此代码不使用NEON。 A NEON port would be interesting. 一个NEON端口会很有趣。 Not sure if it will improve the performance much though. 不确定它是否会提高性能。

Edit: 编辑:

I found the code with assembler ported to GAS (GNU Toolchain). 我发现汇编程序的代码移植到GAS(GNU工具链)。 This code is working and tested: 此代码正在运行和测试:

Divide.S Divide.S

.section ".text"

.global udiv64

udiv64:
    adds      r0,r0,r0
    adc       r1,r1,r1

    .rept 31
        cmp     r1,r2   
        subcs   r1,r1,r2  
        adcs    r0,r0,r0
        adc     r1,r1,r1
    .endr

    cmp     r1,r2
    subcs   r1,r1,r2
    adcs    r0,r0,r0

    bx      lr

C-Code: C代码:

extern "C" uint32_t udiv64 (uint32_t a, uint32_t b, uint32_t c);

int32_t fixdiv24 (int32_t a, int32_t b)
/* calculate (a<<24)/b with 64 bit immediate result */
{
  int q;
  int sign = (a^b) < 0; /* different signs */
  uint32_t l,h;
  a = a<0 ? -a:a;
  b = b<0 ? -b:b;
  l = (a << 24);
  h = (a >> 8);
  q = udiv64 (l,h,b);
  if (sign) q = -q;
  return q;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM