简体繁体 English

是什么让Apple的PowerPC memcpy如此之快？

[英]What makes Apple's PowerPC memcpy so fast?

原文 2010-01-02 01:57:48 0 5 optimization/ memcpy/ powerpc/ shark/ altivec

I've written several copy functions in search of a good memory strategy on PowerPC. 为了在PowerPC上寻找一个好的内存策略，我写了几个复制函数。 Using the Altivec or fp registers with cache hints (dcb*) doubles the performance over a simple byte copy loop for large data. 使用具有高速缓存提示（dcb *）的Altivec或fp寄存器可以在大数据的简单字节复制循环中使性能提高一倍。 Initially pleased with that, I threw in a regular memcpy to see how it compared... 10x faster than my best! 最初很满意的是，我定期记忆，看看它的比较...比我最好的速度快10倍！ I have no intention of rewriting memcpy, but I do hope to learn from it and accelerate several simple image filters that spend most of their time moving pixels to and from memory. 我无意重写memcpy，但我希望从中学习并加速几个简单的图像过滤器，这些过滤器花费大部分时间将像素移入和移出内存。

Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. Shark分析显示它们的内部循环使用dcbt预取，有4个向量读取，然后是4个向量写入。 After tweaking my best function to also haul 64 bytes per iteration, the performance advantage of memcpy is still embarrassing. 在调整了我的最佳函数以便每次迭代运行64个字节之后，memcpy的性能优势仍然令人尴尬。 I'm using dcbz to free up bandwidth, Apple uses nothing, but both codes tend to hesitate on stores. 我正在使用dcbz释放带宽，Apple没有使用任何东西，但这两个代码都倾向于对商店犹豫不决。

prefetch
  dcbt future
  dcbt distant future
load stuff
  lvx image
  lvx image + 16
  lvx image + 32
  lvx image + 48
  image += 64
prepare to store
  dcbz filtered
  dcbz filtered + 32
store stuff
  stvxl filtered
  stvxl filtered + 16
  stvxl filtered + 32
  stvxl filtered + 48
  filtered += 64
repeat

Does anyone have some ideas on why very similar code has such a dramatic performance gap? 有没有人对为什么非常相似的代码有如此戏剧性的性能差距有一些想法？ I'd love to marinate the real image filters in whatever secret sauce memcpy is using! 我喜欢用真正的图像过滤器来腌制真正的图像过滤器！

Additional info: All data is vector aligned. 附加信息：所有数据都是矢量对齐的。 I'm making filtered copies of the image, not replacing the original. 我正在制作图像的过滤副本，而不是替换原始图像。 The code runs on PowerPC G4, G5, and Cell PPU. 该代码在PowerPC G4，G5和Cell PPU上运行。 The Cell SPU version is already insanely fast. Cell SPU版本已经非常快。

5 个解决方案

Shark analysis reveals that their inner loop uses dcbt to prefetch, with 4 vector reads, then 4 vector writes. Shark分析显示它们的内部循环使用dcbt预取，有4个向量读取，然后是4个向量写入。 After tweaking my best function to also haul 64 bytes per iteration 调整我的最佳函数后，每次迭代也会运行64个字节

I may be stating the obvious, but since you don't mention the following at all in your question, it may be worth pointing it out: 我可能会说明显而已，但由于你在问题中根本没有提到以下内容，因此可能值得指出：

I would bet that Apple's choice of 4 vectors reads followed by 4 vector writes has as much to do with the G5's pipeline and its management of out-of-order instruction execution in "dispatch groups" as it has with a magical 64-byte perfect line size. 我敢打赌，Apple选择的4个向量读取后跟4个向量写入与G5的流水线和它在“调度组”中的无序指令执行的管理有很大关系，因为它具有神奇的64字节完美线条大小。 Did you notice the line skips in Nick Bastin's linked bcopy.s? 您是否注意到Nick Bastin的链接bcopy.s中的线路跳过？ These mean that the developer thought about how the instruction stream would be consumed by the G5. 这意味着开发人员考虑了G5如何使用指令流。 If you want to reproduce the same performance, it's not enough to read data 64 bytes at a time, you must make sure your instruction groups are well filled (basically, I remember that instructions can be grouped by up to five independent ones, with the first four being non-jump instructions and the fifth only being allowed to be a jump. The details are more complicated). 如果你想重现相同的性能，一次读取64字节数据是不够的，你必须确保你的指令组已经充分填充（基本上，我记得指令最多可以分为五个独立的指令组，前四个是非跳转指令，第五个只允许跳转。细节更复杂）。

EDIT: you may also be interested by the following paragraph on the same page: 编辑：您可能也会对同一页面上的以下段落感兴趣：

The dcbz instruction still zeros aligned 32 byte segments of memory as per the G4 and G3. 根据G4和G3，dcbz指令仍然对齐32位字节的存储器段。 However, since that is not a full cacheline on a G5 it will not have the performance benefits that you were likely hoping for. 但是，由于这不是G5上的完整缓存行，因此它不具备您可能希望的性能优势。 There is a dcbzl instruction newly introduced for the G5 that zeros a full 128-byte cacheline. 有一个为G5新引入的dcbzl指令，它将一个完整的128字节高速缓存行归零。

我不确切知道你在做什么，因为我看不到你的代码，但Apple的秘诀就在这里。

Maybe it's because of CPU caching. 也许是因为CPU缓存。 Try to run CacheGrind : 尝试运行CacheGrind ：

Cachegrind is a cache profiler. Cachegrind是一个缓存分析器。 It performs detailed simulation of the I1, D1 and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code. 它可以对CPU中的I1，D1和L2缓存进行详细模拟，因此可以准确地找出代码中缓存未命中的来源。 It identifies the number of cache misses, memory references and instructions executed for each line of source code, with per-function, per-module and whole-program summaries. 它通过每个功能，每个模块和整个程序摘要识别缓存未命中数，内存引用和为每行源代码执行的指令。 It is useful with programs written in any language. 它适用于使用任何语言编写的程序。 Cachegrind runs programs about 20--100x slower than normal. Cachegrind运行程序比正常情况慢大约20-100倍。

Still not an answer, but did you verify that memcpy is actually moving the data? 仍然没有答案，但你确认memcpy实际上是在移动数据吗？ Maybe it was just remapped copy-on-write. 也许它只是重新映射写入。 You would still see the inner memcpy loop in Shark as part of the first and last pages are truly copied. 你仍然可以看到Shark中的内部memcpy循环作为第一页和最后一页的一部分被真正复制。

As mentioned in another answer, "dcbz", as defined by Apple on the G5, only operates on 32-bytes, so you will lose performance with this instruction on a G5 which has 128 byte cachelines. 正如另一个答案中所提到的，Apple在G5上定义的“dcbz”仅在32字节上运行，因此在具有128字节高速缓存行的G5上使用该指令会失去性能。 You need to use "dcbzl" to prevent the destination cacheline from being fetched from memory (and effectively reducing your useful read memory bandwidth by half). 您需要使用“dcbzl”来防止从内存中获取目标高速缓存行（并有效地将有用的读取内存带宽减少一半）。