简体   繁体   English

为什么需要模运算符?

[英]Why is modulo operator necessary?

I've read in a document that you can replace mod operation by logical and like this:我在文档中读到,您可以通过逻辑替换 mod 操作,如下所示:

Instead:反而:

int Limit = Value % Range;

You do:你做:

int Limit = Value & (Range-1);

But compilers still generate mod instructions and my question is basically: Why do compilers don't use the most efficient approach if they work the same?但是编译器仍然生成 mod 指令,我的问题基本上是:如果编译器工作相同,为什么编译器不使用最有效的方法?

Um no... that only works when Range is a power of two.嗯,不……只有当Range是 2 的幂时才有效。

For all other values, you still need the modulus % operator.对于所有其他值,您仍然需要模数%运算符。

There are also some subtle (possibly implementation-defined) differences when working with negative numbers.在处理负数时,也有一些细微的(可能是实现定义的)差异。


As a side note: Using the % operator is probably more readable too.作为旁注:使用%运算符可能也更具可读性。

you can replace modulo with that only if it is a power of 2. Using elementary math to replace it without a modulo只有当它是 2 的幂时,你才能用它替换模数。使用初等数学来替换它而不用模数

a = b % c;

can be done with可以用

x = b % c;
a = b / (x*c);

Lets check this with an example让我们用一个例子来检查一下

25 % 7 = 
25 / 7 = 3 (integer math)
25 - (3 * 7) =
25 - 21 = 4

Which is how I have to do it on my calculator anyway as I dont have a modulo operator.这就是我必须在计算器上执行此操作的方式,因为我没有模运算符。

Note that注意

25 & (7-6) = 
0x19 & 0x6 = 0x0

So your substitution does not work.所以你的替代不起作用。

Not only do most processors not have a modulo, many do not have a divide.不仅大多数处理器没有模,许多处理器也没有除法。 Check out the hackers delight book.查看黑客乐趣书。

WHY would you want modulo?你为什么要取模? If you have burned the hardware to make a divide, you might be willing to go that extra mile to add modulo as well.如果您烧掉了硬件来进行除法运算,那么您可能愿意付出额外的努力 go 来添加模数。 Most processors take your question to the next level, why would you implement a divide in hardware when it can be done in software.大多数处理器将您的问题提升到一个新的水平,当可以在软件中完成时,为什么要在硬件中实现划分。 The answer to your question is most processor families do not have a modulo, and many do not have a divide because it is not worth the chip real estate, power consumed, etc compared to the software solution.您的问题的答案是大多数处理器系列没有模数,而且许多没有除法器,因为与软件解决方案相比,芯片空间、功耗等不值得。 The software solution is less painful/costly/risky.软件解决方案的痛苦/成本/风险较小。

Now I assume your question is not what the winning poster answered.现在我假设你的问题不是获奖海报的回答。 For cases where the Range is a power of two and the identity does work... First off if range is not known at compile time then you have to do a subtract and an and, two operations, and maybe an intermediate variable, that is much more costly than a modulo, the compiler would be in error to optimize to a subtract and and instead of a modulo.对于 Range 是 2 的幂并且身份确实有效的情况......首先,如果在编译时不知道范围,那么你必须做一个减法和一个与,两个操作,也许还有一个中间变量,那就是比模数更昂贵,编译器将错误地优化为减法和而不是模数。 If the range is a power of two and is known at compile time your better/fancier compilers will optimize.如果范围是 2 的幂并且在编译时已知,那么您的更好/更高级的编译器将进行优化。 There are times, esp with a variable word length instruction set where the smaller instruction may be used over the larger instruction, it might be less painful to load Range and do a modulo than to load the larger number of non-zero bits (values of Range that match your identity have a single bit set in the value, the other bits are zero, 0x100, 0x40, 0x8000, etc) and do the modulo.有时,尤其是具有可变字长指令集的情况下,较小的指令可能会用于较大的指令,加载 Range 并执行模运算可能比加载更多的非零位(值与您的身份匹配的范围在值中设置了一位,其他位为零、0x100、0x40、0x8000 等)并进行取模。 the load immediate plus modulo might be cheaper than the load immediate plus and, or the modulo immediate might be cheaper than the and immediate. load immediate plus modulo 可能比load immediate plus and 便宜,或者modulo immediate 可能比load immediate 便宜。 You have to examine the instruction set and how the compiler has implemented the solution.您必须检查指令集以及编译器如何实现解决方案。

I suggest you post some examples of where it is not doing the optimization, and I assume we can post many examples of where the compiler has done the optimization you were expecting.我建议您发布一些未进行优化的示例,并且我假设我们可以发布许多编译器已完成您期望的优化的示例。

As others have stated, the range has to be 2^n-1, and even then, if it's done at run-time, you have problems.正如其他人所说,范围必须是 2^n-1,即使这样,如果它是在运行时完成的,你也会遇到问题。

On recent architectures (let's say, anything after P4 era) the latency on integer division instructions is between 26 and 50 or so cycles worst case.在最近的体系结构(比方说,P4 时代之后的任何体系结构)上,integer 除法指令的延迟在最坏情况下为 26 到 50 个左右的周期。 A multiply, in comparison, can be 1-3 cycles and can often be done in parallel much better.相比之下,乘法可以是 1-3 个周期,并且通常可以更好地并行完成。

The DIV instruction returns the quotient in EAX and the remainder in EDX. DIV 指令返回 EAX 中的商和 EDX 中的余数。 The "remainder" is free (the modulus is the remainder). “余数”是自由的(模数是余数)。

If you implement something where the range is variable at run-time, if you wish to use &, you have to:如果你实现了范围在运行时可变的东西,如果你想使用 &,你必须:

a) check if the range is 2^n-1, if so use your & codepath: which is a branch, possible cache miss etc. etc. adding huge latency potential b) if it is not 2^n-1, use a DIV instruction a) 检查范围是否为 2^n-1,如果是,请使用您的 & 代码路径:这是一个分支,可能的缓存未命中等。增加巨大的潜在延迟 b) 如果它不是 2^n-1,请使用DIV指令

Using a DIV instead of adding a branch into the equation (which is the potential to cost hundreds or even thousands of cycles in bad cases with poor cache eviction) makes DIV the obvious best choice.使用 DIV 而不是在等式中添加分支(在缓存驱逐不佳的情况下,这可能会花费数百甚至数千个周期)使 DIV 成为明显的最佳选择。 On top of that, if you are using & with a signed data type, conversions will be necessary (there is no & for mixed data types but there are for DIVs).最重要的是,如果您将 & 与带符号的数据类型一起使用,则需要进行转换(混合数据类型没有 & 但 DIV 有)。 In addition if the DIV is only used to branch from the modulus and the rest of the results aren't used, speculative execution can perform nicely;另外如果DIV只是用来从模数分支出来,结果的rest没有用到,推测执行可以很好的执行; also performance penalties are further mitigated by multiple pipeline that can execute instructions in parallel.可以并行执行指令的多个流水线还进一步减轻了性能损失。

You have to remember that if you are using real code, a lot of your cache will be filled with the data you are working on, and other code and data you will be working with soon or have just worked on.您必须记住,如果您使用的是真实代码,那么您的大量缓存将充满您正在处理的数据,以及您即将处理或刚刚处理的其他代码和数据。 You really don't want to be evicting cache pages and waiting for them to page in because of branch mispredictions.您真的不想因为分支预测错误而逐出缓存页面并等待它们调入页面。 In most cases with modulo, you are not just going i = 7;在大多数情况下,你不只是要 i = 7; d = i % 4; d = i % 4; you're using larger code that often calls a subroutine which itself is a (predicted and cached) subroutine call directly before.您正在使用更大的代码,这些代码通常会调用一个子例程,该子例程本身就是一个(预测和缓存的)子例程直接调用。 In addition you're probably doing it in a loop which itself is also using branch prediction;此外,您可能正在循环中执行此操作,该循环本身也使用分支预测; nested branch predictions with loops are handled pretty well in modern microprocessors but it just ends up being plain stupid to add to the predicting it's trying to do.带有循环的嵌套分支预测在现代微处理器中处理得很好,但它最终只是愚蠢地添加到它试图做的预测中。

So to summarize, using DIV makes more sense on modern processors for a general usage case;因此总而言之,对于一般用例,使用 DIV 在现代处理器上更有意义; it is not really an "optimization" for a compiler to generate 2^n-1 because of cache considerations and other stuff.由于缓存考虑和其他因素,编译器生成 2^n-1 并不是真正的“优化”。 If you really really need to fine-tune that integer divide, and your whole program depends on it, you will end up hard-coding the divisor to 2^n-1 and making bitwise & logic yourself.如果您真的真的需要微调 integer 除法,并且您的整个程序都依赖于它,那么您最终会将除数硬编码为 2^n-1 并自己制作按位和逻辑。

Finally, this is a bit of a rant - a dedicated ALU unit for integer divides can really reduce the latency to around 6-8 cycles, it just takes up a relatively large die area because the data path ends up being about 128 bits wide and nobody has the real estate for it when integer DIVs work just fine how they are.最后,这有点夸张——一个用于 integer 除法的专用 ALU 单元确实可以将延迟减少到大约 6-8 个周期,它只是占用了一个相对较大的裸片面积,因为数据路径最终约为 128 位宽并且当 integer DIV 工作正常时,没有人拥有它的房地产。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM