在以 4000 作为除数的 cuda gpu 上计算快速模运算。 eq: (ab)%4000

Question

i'm trying to optimize modulo arithmetic in cuda on pascal architecture (nvidia 1060) since the conventional (%) operator significantly slows down the code.我正在尝试在 pascal 架构（nvidia 1060）上优化 cuda 中的模运算，因为传统的 (%) 运算符会显着减慢代码速度。 I have seen some examples of optimization but they apply only if the divisor is a power of 2 or (2^k)-1.我看过一些优化的例子，但它们仅在除数是 2 或 (2^k)-1 的幂时才适用。 In my code, the divisor is 4000.在我的代码中，除数是 4000。

kindly, suggest me an optimized approach to calculate remainder in the below equation好心，建议我一个优化的方法来计算下面的等式中的余数

  remainder = (a-b)%4000

Answer 1

I am assuming you can demonstrate how slow, say compared to modulo 4096 both with compiler optimisation and using bitmasks?我假设您可以证明与使用编译器优化和使用位掩码的模 4096 相比有多慢？ If it is only 2 or 3 times slower you really can't beat it,如果它只慢 2 到 3 倍，你真的无法击败它，

For fun, because I doubt you will beat the above metric:为了好玩，因为我怀疑你会超过上述指标：

Division is generally not that slow on modern processors, but one thing to be aware of is that when it was slow it depended on the size of the number being divided.在现代处理器上，除法通常不会那么慢，但要注意的一件事是，当它变慢时，它取决于被除数的大小。 Another is that unsigned divide was faster than signed divide.另一个是无符号除法比有符号除法更快。

One way to reduce the size of the number is to consider how a modulo is built up.减少数字大小的一种方法是考虑如何建立模数。

If you perform div and mod 4096, you can then ask, what is 4096 mod 4000 = 96. So the mod 4000 of your original number is (96 * div4096 + mod4096) mod 4000 where these are smaller numbers than you started with and might, just maybe, be faster because it uses fewer bits.如果你执行 div 和 mod 4096，那么你可以问，什么是 4096 mod 4000 = 96。所以你的原始数字的 mod 4000 是(96 * div4096 + mod4096) mod 4000其中这些是比你开始时更小的数字并且可能，只是也许更快，因为它使用更少的位。 Note that at this stage you can also use the relationship that 4000 = 32 * 125, so the bottom 5 bits will be the bottom 5 bits of the modulo, and you only need divide by 125.注意，在这个阶段你也可以使用4000 = 32 * 125的关系，所以低5位将是模的低5位，你只需要除以125。

Now on a 8-bit processor, dividing by less than 128 can be significantly faster than division by a bigger number!现在在 8 位处理器上，除以小于 128 比除以更大的数字要快得多！ I doubt you have one of those, though.不过，我怀疑您是否拥有其中之一。

Another option is to use high precision inverse multiply.另一种选择是使用高精度逆乘法。 Processors that have poor divide may have acceptable multiply.具有较差除法的处理器可能具有可接受的乘法。 This trick is that you use the biggest integers that you can to perform a multiply that is 2^n/4000, where n is half the width of the large integer type, or can be higher, if the max number you need to divide is less than 2^n.这个技巧是你可以使用最大的整数来执行 2^n/4000 的乘法，其中 n 是大整数类型宽度的一半，或者可以更高，如果你需要除以的最大数是小于 2^n。 The top part of that number (>>n) is the (approx) result of division, and if high enough resolution, should be "close enough".该数字的顶部 (>>n) 是除法的（近似）结果，如果分辨率足够高，应该“足够接近”。 Multiply that value by 4000 again and subtract from your original, and you have your modulo +/- a few times 4000, for the cost of 2 big multiplies vs 1 smaller divide.再次将该值乘以 4000 并从原始值中减去，您的模数为 +/- 4000 的几倍，这是 2 次大乘法与 1 次小除法的成本。 On intel there is a multiply that inputs the 16 bit values ax*dx and outputs the 32 bit value dx:ax , and is replicated for 64-bit edx*eax => 128 bit edx:eax , but of course intel 386 and later has a fast-enough divide anyway.在 intel 上有一个乘法输入 16 位值ax*dx并输出 32 位值dx:ax ，并为64-bit edx*eax => 128 bit edx:eax复制，但当然是 intel 386 及更高版本无论如何都有足够快的分歧。

And yet another generic approach, when the divisor you want is close to a power of 2, in your case 4000 is 97% of 4096:还有另一种通用方法，当您想要的除数接近 2 的幂时，在您的情况下，4000 是 4096 的 97%：

loop:
  do the div4096 by bit shift
  multiply 4000 by div4096
  subtract
until result < 3*4096 
use if statement to get final mod value

This performs repeated multiplies, but each time, div4096 is a low estimator for div4000, by 3%, 0.03, about 1 in 64 or 6 bits, which gets cleaned up by the next iteration, so it will go round this loop perhaps 7 times for a 64-bit maxed out value.这将执行重复乘法，但每次，div4096 都是 div4000 的低估计量，降低 3%，0.03，大约 64 位或 6 位中的 1 个，在下一次迭代时会被清除，因此它可能会循环此循环 7 次对于 64 位最大值。 If mul is 7* faster than div, then you win.如果 mul 比 div 快 7*，那么你就赢了。 If the value you want to mod or div is more than a couple of percent off a power of 2, then the iteration count gets too high.如果您想要 mod 或 div 的值比 2 的幂低几个百分点，那么迭代次数就太高了。

在以 4000 作为除数的 cuda gpu 上计算快速模运算。 eq: (ab)%4000

问题描述

1 个解决方案

解决方案1
1 2019-07-25 10:24:08

在以 4000 作为除数的 cuda gpu 上计算快速模运算。 eq: (ab)%4000

问题描述

1 个解决方案

解决方案1 1 2019-07-25 10:24:08

解决方案1
1 2019-07-25 10:24:08