简体繁体 English

关于ARM的预取指令

[英]Prefetch instructions on ARM

原文 2008-09-17 12:11:54 1 4 c++/ arm/ assembly

Newer ARM processors include the PLD and PLI instructions. 较新的ARM处理器包括PLD和PLI指令。

I'm writing tight inner loops (in C++) which have a non-sequential memory access pattern, but a pattern that naturally my code fully understands. 我正在编写紧密的内部循环（在C ++中），它具有非顺序的内存访问模式，但是我的代码完全理解的模式。 I would anticipate a substantial speedup if I could prefetch the next location whilst processing the current memory location, and I would expect this to be quick-enough to try out to be worth the experiment! 如果我可以在处理当前内存位置的同时预取下一个位置，我预计会有大幅加速，我希望这很快就足以尝试值得实验！

I'm using new expensive compilers from ARM, and it doesn't seem to be including PLD instructions anywhere, let alone in this particular loop that I care about. 我正在使用来自ARM的新的昂贵的编译器，它似乎没有在任何地方包含PLD指令，更不用说在我关心的这个特定循环中了。

How can I include explicit prefetch instructions in my C++ code? 如何在C ++代码中包含显式预取指令？

4 个解决方案

There should be some Compiler-specific Features. 应该有一些编译器特有的功能。 There is no standard way to do it for C/C++. C / C ++没有标准的方法。 Check out you compiler Compiler Reference Guide. 看看编译器编译器参考指南。 For RealView Compiler see this or this . 对于RealView编译器，请参阅此或此。

If you are trying to extract truly maximum performance from these loops, than I would recommend writing the entire looping construct in assembler. 如果您试图从这些循环中提取真正的最大性能，那么我建议在汇编程序中编写整个循环结构。 You should be able to use inline assembly depending on the data structures involved in your loop. 您应该能够使用内联汇编，具体取决于循环中涉及的数据结构。 Even better if you can unroll any piece of your loop (like the parts involved in making the access non-sequential). 如果你可以展开任何一个循环（比如使访问非顺序访问所涉及的部分），那就更好了。

At the risk of asking the obvious: have you verified the compiler's target architecture? 冒着明显的问题：你有没有验证过编译器的目标架构？ For example (humor me), if by default the compiler is targeted to ARM7, you're never going to see the PLD instruction. 例如（幽默我），如果默认编译器是针对ARM7的，那么你永远不会看到PLD指令。

It is not outside the realm of possibility that other optimizations like software pipelining and loop unrolling may achieve the same effect as your prefetching idea (hiding the latency of the loads by overlapping it with useful computation), but without the extra instruction-cache pressure caused by the extra instructions. 软件流水线和循环展开之类的其他优化可能会达到与预取想法相同的效果（通过将负载与有用计算重叠来隐藏负载的延迟），但不会产生额外的指令缓存压力。通过额外的指示。 I would even go so far as to say that this is the case more often than not, for tight inner loops that tend to have few instructions and little control flow. 我甚至会说这种情况经常发生，因为紧密的内环往往只有很少的指令和很少的控制流。 Is your compiler doing these types of traditional optimizations instead. 您的编译器是否正在执行这些类型的传统优化。 If so, it may be worth looking at the pipeline diagram to develop a more detailed cost model of how your processor works, and evaluate more quantitatively whether prefetching would help. 如果是这样，可能值得查看管道图，以开发更详细的成本模型，了解处理器的工作方式，并更加定量地评估预取是否有用。