[英]Modulo operator slower than manual implementation?
I have found that manually calculating the %
operator on __int128
is significantly faster than the built-in compiler operator.我发现在__int128
上手动计算%
运算符比内置编译器运算符要快得多。 I will show you how to calculate modulo 9, but the method can be used to calculate modulo any other number.我将向您展示如何计算模 9,但该方法可用于计算模任何其他数字。
First, consider the built-in compiler operator:首先,考虑内置的编译器运算符:
uint64_t mod9_v1(unsigned __int128 n)
{
return n % 9;
}
Now consider my manual implementation:现在考虑我的手动实现:
uint64_t mod9_v2(unsigned __int128 n)
{
uint64_t r = 0;
r += (uint32_t)(n);
r += (uint32_t)(n >> 32) * (uint64_t)4;
r += (uint32_t)(n >> 64) * (uint64_t)7;
r += (uint32_t)(n >> 96);
return r % 9;
}
Measuring over 100,000,000 random numbers gives the following results:测量超过 100,000,000 个随机数会得出以下结果:
mod9_v1 | 3.986052 secs
mod9_v2 | 1.814339 secs
GCC 9.3.0 with -march=native -O3
was used on AMD Ryzen Threadripper 2990WX.在 AMD Ryzen Threadripper 2990WX 上使用了带有-march=native -O3
的 GCC 9.3.0。 Here is a link to godbolt.这是godbolt的链接。
I would like to ask if it behaves the same way on your side?我想问一下它在你这边的行为是否相同? (Before reporting a bug to GCC Bugzilla). (在向 GCC Bugzilla 报告错误之前)。
UPDATE: On request, I supply a generated assembly:更新:根据要求,我提供了一个生成的程序集:
mod9_v1:
sub rsp, 8
mov edx, 9
xor ecx, ecx
call __umodti3
add rsp, 8
ret
mod9_v2:
mov rax, rdi
shrd rax, rsi, 32
mov rdx, rsi
mov r8d, eax
shr rdx, 32
mov eax, edi
add rax, rdx
lea rax, [rax+r8*4]
mov esi, esi
lea rcx, [rax+rsi*8]
sub rcx, rsi
mov rax, rcx
movabs rdx, -2049638230412172401
mul rdx
mov rax, rdx
shr rax, 3
and rdx, -8
add rdx, rax
mov rax, rcx
sub rax, rdx
ret
The reason for this difference is clear from the assembly listings: the %
operator applied to 128-bit integers is implemented via a library call to a generic function that cannot take advantage of compile time knowledge of the divisor value, which makes it possible to turn division and modulo operations into much faster multiplications.从汇编列表中可以清楚地看出这种差异的原因:应用于 128 位整数的%
运算符是通过对通用 function 的库调用实现的,该函数无法利用除数的编译时知识,这使得可以转除法和模运算变成更快的乘法。
The timing difference is even more significant on my old Macbook-pro using clang, where I mod_v2()
is x15 times faster than mod_v1()
.在我使用 clang 的旧 Macbook-pro 上,时序差异更加显着,其中我mod_v2()
比mod_v1()
快x15倍。
Note however these remarks:但请注意以下备注:
for
loop, not after the first printf
as currently coded.您应该在for
循环结束后测量 CPU 时间,而不是在当前编码的第一个printf
之后。rand_u128()
only produces 124 bits assuming RAND_MAX
is 0x7fffffff
. rand_u128()
仅产生 124 位,假设RAND_MAX
为0x7fffffff
。Using your slicing approach, I extended you code to reduce the number of steps using slices of 42, 42 and 44 bits, which further improves the timings (because 2 42 % 9 == 1):使用您的切片方法,我扩展了您的代码以减少使用 42、42 和 44 位切片的步骤数,这进一步改进了时序(因为 2 42 % 9 == 1):
#pragma GCC diagnostic ignored "-Wpedantic"
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <assert.h>
#include <inttypes.h>
#include <stdio.h>
#include <time.h>
static uint64_t mod9_v1(unsigned __int128 n) {
return n % 9;
}
static uint64_t mod9_v2(unsigned __int128 n) {
uint64_t r = 0;
r += (uint32_t)(n);
r += (uint32_t)(n >> 32) * (uint64_t)(((uint64_t)1ULL << 32) % 9);
r += (uint32_t)(n >> 64) * (uint64_t)(((unsigned __int128)1 << 64) % 9);
r += (uint32_t)(n >> 96);
return r % 9;
}
static uint64_t mod9_v3(unsigned __int128 n) {
return (((uint64_t)(n >> 0) & 0x3ffffffffff) +
((uint64_t)(n >> 42) & 0x3ffffffffff) +
((uint64_t)(n >> 84))) % 9;
}
unsigned __int128 rand_u128() {
return ((unsigned __int128)rand() << 97 ^
(unsigned __int128)rand() << 66 ^
(unsigned __int128)rand() << 35 ^
(unsigned __int128)rand() << 4 ^
(unsigned __int128)rand());
}
#define N 100000000
int main() {
srand(42);
unsigned __int128 *arr = malloc(sizeof(unsigned __int128) * N);
if (arr == NULL) {
return 1;
}
for (size_t n = 0; n < N; ++n) {
arr[n] = rand_u128();
}
#if 1
/* check that modulo 9 is calculated correctly */
for (size_t n = 0; n < N; ++n) {
uint64_t m = mod9_v1(arr[n]);
assert(m == mod9_v2(arr[n]));
assert(m == mod9_v3(arr[n]));
}
#endif
clock_t clk1 = -clock();
uint64_t sum1 = 0;
for (size_t n = 0; n < N; ++n) {
sum1 += mod9_v1(arr[n]);
}
clk1 += clock();
clock_t clk2 = -clock();
uint64_t sum2 = 0;
for (size_t n = 0; n < N; ++n) {
sum2 += mod9_v2(arr[n]);
}
clk2 += clock();
clock_t clk3 = -clock();
uint64_t sum3 = 0;
for (size_t n = 0; n < N; ++n) {
sum3 += mod9_v3(arr[n]);
}
clk3 += clock();
printf("mod9_v1: sum=%"PRIu64", elapsed time: %.3f secs\n", sum1, clk1 / (double)CLOCKS_PER_SEC);
printf("mod9_v2: sum=%"PRIu64", elapsed time: %.3f secs\n", sum2, clk2 / (double)CLOCKS_PER_SEC);
printf("mod9_v3: sum=%"PRIu64", elapsed time: %.3f secs\n", sum3, clk3 / (double)CLOCKS_PER_SEC);
free(arr);
return 0;
}
Here are the timings on my linux server (gcc):以下是我的 linux 服务器 (gcc) 上的时间:
mod9_v1: sum=400041273, elapsed time: 7.992 secs
mod9_v2: sum=400041273, elapsed time: 1.295 secs
mod9_v3: sum=400041273, elapsed time: 1.131 secs
The same code on my Macbook (clang):我的 Macbook 上的相同代码(clang):
mod9_v1: sum=399978071, elapsed time: 32.900 secs
mod9_v2: sum=399978071, elapsed time: 0.204 secs
mod9_v3: sum=399978071, elapsed time: 0.185 secs
In the mean time (while waiting for Bugzilla), you could let the preprocessor do the optimization for you.同时(在等待 Bugzilla 时),您可以让预处理器为您进行优化。 Eg define a macro called MOD_INT128(n,d):例如定义一个名为 MOD_INT128(n,d) 的宏:
#define MODCALC0(n,d) ((65536*n)%d)
#define MODCALC1(n,d) MODCALC0(MODCALC0(n,d),d)
#define MODCALC2(n,d) MODCALC1(MODCALC1(n,d),d)
#define MODCALC3(n,d) MODCALC2(MODCALC1(n,d),d)
#define MODPARAM(n,d,a,b,c) \
((uint64_t)((uint32_t)(n) ) + \
(uint64_t)((uint32_t)(n >> 32) * (uint64_t)a) + \
(uint64_t)((uint32_t)(n >> 64) * (uint64_t)b) + \
(uint64_t)((uint32_t)(n >> 96) * (uint64_t)c) ) % d
#define MOD_INT128(n,d) MODPARAM(n,d,MODCALC1(1,d),MODCALC2(1,d),MODCALC3(1,d))
Now,现在,
uint64_t mod9_v3(unsigned __int128 n)
{
return MOD_INT128( n, 9 );
}
will generate similar assembly language as the mod9_v2() function, and将生成与 mod9_v2() function 类似的汇编语言,并且
uint64_t mod8_v3(unsigned __int128 n)
{
return MOD_INT128( n, 8 );
}
works fine with already existing optimization (GCC 10.2.0)与现有的优化(GCC 10.2.0)一起工作正常
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.