简体   繁体   English

为什么局部可变长度的for循环更快? 分支预测不会减少查找时间的影响吗?

[英]Why are local variable length for-loops faster? Doesn't branch prediction reduce the effect of lookup times?

A while back, I was reading up on some Android performance tips when I came by: 前阵子,当我来的时候,我正在阅读一些Android性能提示

Foo[] mArray = ...

public void zero() {
    int sum = 0;
    for (int i = 0; i < mArray.length; ++i) {
        sum += mArray[i].mSplat;
    }
}

public void one() {
    int sum = 0;
    Foo[] localArray = mArray;
    int len = localArray.length;

    for (int i = 0; i < len; ++i) {
        sum += localArray[i].mSplat;
    }
}

Google says: Google说:

zero() is slowest, because the JIT can't yet optimize away the cost of getting the array length once for every iteration through the loop. zero()最慢,因为JIT尚无法优化循环中每次迭代一次获取数组长度的成本。

one() is faster. one()更快。 It pulls everything out into local variables, avoiding the lookups. 它将所有内容提取到局部变量中,从而避免了查找。 Only the array length offers a performance benefit. 只有阵列长度才能提供性能优势。

Which made total sense. 完全有意义。 But after thinking way too much about my computer architecture exam I remembered Branch Predictors : 但是对我的计算机体系结构考试考虑得太多之后,我想起了Branch Predictors

a branch predictor is a digital circuit that tries to guess which way a branch (eg an if-then-else structure) will go before this is known for sure. 分支预测器是一种数字电路,它试图猜测在确定之前知道分支(例如,if-then-else结构)将走哪条路。 The purpose of the branch predictor is to improve the flow in the instruction pipeline. 分支预测器的目的是改善指令管道中的流程。

Isn't the computer assuming i < mArray.length is true and thus, computing the loop condition and the body of the loop in parallel (and only predicting the wrong branch on the last loop) , effectively removing any performance loses? 计算机是否不是假设 i < mArray.length true ,从而并行计算循环条件和循环主体 (并且仅预测最后一个循环的错误分支) ,从而有效地消除了性能损失?

I was also thinking about Speculative Execution : 我也在考虑投机执行

Speculative execution is an optimization technique where a computer system performs some task that may not be actually needed... The objective is to provide more concurrency... 推测执行是一种优化技术,其中计算机系统执行某些实际上可能不需要的任务……目标是提供更多的并发性……

In this case, the computer would be executing the code both as if the loop had finished and as if it was still going concurrently , once again, effectively nullifying any computational costs associated with the condition (since the computer's already performing computations for the future while it computes the condition)? 在这种情况下, 计算机将同时执行代码,就好像循环已经完成,并且好像仍在并发进行一样 ,再次有效地消除了与该条件相关的任何计算成本 (因为计算机已经为将来执行了计算)它计算条件)?

Essentially what I'm trying to get at is the fact that, even if the condition in zero() takes a little longer to compute than one() , the computer is usually going to compute the correct branch of code while it's waiting to retrieve the answer to the conditional statement anyway, so the performance loss in the lookup to myAray.length shouldn't matter (that's what I thought anyway). 从本质上讲,我试图得出的事实是,即使zero()的条件要比one()花费更长的时间进行计算,计算机通常也会在等待检索时计算出正确的代码分支无论如何,都是对条件语句的答案,因此对myAray.length的查找中的性能损失不重要(无论如何,这就是我的想法)。

Is there something I'm not realizing here? 这里有我没有意识到的东西吗?


Sorry about the length of the question. 很抱歉问题的长度。

Thanks in advance. 提前致谢。

The site you linked to notes: 您链接到的站点注释:

zero() is slowest, because the JIT can't yet optimize away the cost of getting the array length once for every iteration through the loop. zero()最慢,因为JIT尚无法优化循环中每次迭代一次获取数组长度的成本。

I haven't tested on Android, but I'll assume that this is true for now. 我尚未在Android上进行过测试,但现在我认为这是对的。 What this means is that for every iteration of the loop the CPU has to execute code that loads the value of mArray.length from memory. 这意味着,对于循环的每次迭代,CPU必须执行从内存加载mArray.length值的代码。 The reason is that the length of the array may change so the compiler can't treat it as a static value. 原因是数组的长度可能会更改,因此编译器无法将其视为静态值。

Whereas in the one() option the programmer explicitly sets the len variable based on knowledge that the array length won't change. 而在one()选项中,程序员根据对数组长度不变的认识来显式设置len变量。 Since this is a local variable the compiler can store it in a register rather than loading it from memory in each loop iteration. 由于这是一个局部变量,因此编译器可以将其存储在寄存器中,而不必在每次循环迭代中从内存中加载它。 So this will reduce the number of instructions executed in the loop, and it will make the branch easier to predict. 因此,这将减少循环中执行的指令数量,并使分支更容易预测。

You are right that branch prediction helps reduce the overhead associated with the loop condition check. 没错,分支预测有助于减少与循环条件检查相关的开销。 But there is still a limit to how much speculation is possible so executing more instructions in each loop iteration can incur additional overhead. 但是仍然有可能进行多少推测,因此在每个循环迭代中执行更多的指令会产生额外的开销。 Also many mobile processors have less advanced branch predictors and don't support as much speculation. 同样,许多移动处理器的分支预测器也不那么先进,并且不支持那么多的推测。

My guess is that on a modern desktop processor using an advanced Java JIT like HotSpot that you would not see a 3X performance difference. 我的猜测是,在使用像HotSpot这样的高级Java JIT的现代台式机处理器上,您不会看到3倍的性能差异。 But I don't know for certain, it could be an interesting experiment to try. 但是我不确定,尝试尝试可能是一个有趣的实验。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM