简体   繁体   English

对于简单的二进制减法,由SSE引起的最大理论加速是多少?

[英]What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. 在试图弄清楚我的代码的内循环是否达到硬件设计障碍或缺乏对我的部分障碍的理解。 There's a bit more to it, but the simplest question I can come up with to answer is as follows: 还有一点,但我能想出的最简单的问题如下:

If I have the following code: 如果我有以下代码:

float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;

initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);

for(i=0;i<32768-1;i++) {
  xref=px[i];
  yref=py[i];
  zref=pz[i];
  for(j=0;j<32768-1;j++ {
    deltx=xref-px[j];
    delty=yref-py[j];
    deltz=zref-pz[j];
  } }

What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (ie it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process). 在我完全控制代码(汇编,内在函数,无论如何)的情况下,通过转到SSE指令,我可以看到什么类型的最大理论加速,但是除了架构之外没有对运行时环境的控制(即,它是多个用户环境所以我无法对操作系统内核如何为我的特定进程分配时间做任何事情。

Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). 现在我看到我的代码速度提高了3倍,当我想到使用SSE会给我更多的矢量深度比3倍加速指示(可能是3倍加速告诉我我有4倍的最大理论值吞吐量)。 (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously. (我已经尝试过让deltx / delty / deltz成为数组,以防编译器不够智能自动提升它们,但我仍然看到只有3倍加速。)我正在使用intel C编译器用于矢量化的适当编译器标志,但显然没有内在函数。

It depends on the CPU. 这取决于CPU。 But the theoretical max won't get above 4x. 但理论上的最大值不会高于4倍。 I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle. 我不知道每个时钟周期可以执行多个SSE指令的CPU,这意味着它每个周期最多可以计算4个值。

Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup. 大多数CPU每个周期至少可以执行一个浮点标量指令,因此在这种情况下,您会看到理论上最大值为4倍加速。

But you'll have to look up the specific instruction throughput for the CPU you're running on. 但是您必须查找正在运行的CPU的特定指令吞吐量。

A practical speedup of 3x is pretty good though. 实际加速3倍是相当不错的。

I think you'd probably have to interleave the inner loop somehow. 我想你可能不得不以某种方式交错内循环。 The 3-component vector is getting done at once, but that's only 3 operations at once. 3分量矢量一次完成,但这只是一次3个操作。 To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. 要达到4,你将从第一个向量中做3个组件,从下一个向量做1个组件,然后是2和2,依此类推。 If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work. 如果您建立了某种类型的队列,一次加载和处理数据4组件,然后将其分开,这可能有效。

Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). 编辑:您可以展开内部循环以每次迭代执行4个向量(假设数组大小始终是4的倍数)。 That would accomplish what I said above. 这将完成我上面所说的。

Consider: How wide is a float? 考虑一下:浮子有多宽? How wide is the SSEx instruction? SSEx指令有多宽? The ratio should should give you some kind of reasonable upper bound. 这个比例应该给你一些合理的上限。

It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup. 同样值得注意的是,无序管道在获得加速预测方面发挥了巨大作用。

You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. 你应该考虑循环平铺 - 你在内循环中访问值的方式可能会导致L1数据缓存中的大量颠簸。 It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you. 这并不算太糟糕,因为所有内容都可能仍然适用于384 KB的L2,但是在L1缓存命中和L2缓存命中之间很容易存在一个数量级的差异,因此这对您来说可能会有很大的不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM