简体   繁体   English

如何在许多条件下向量化循环?

[英]How to vectorize a loop with many conditions?

I have the loop below. 我在下面的循环。 The goal is to perform an operation between all elements of an array tmp and store it in a scalar b . 目标是在数组tmp所有元素之间执行操作,并将其存储在标量b The operation is equivalent to an addition, so there is no specific execution order. 该操作等效于加法,因此没有特定的执行顺序。 For example if we have a + b + c + d, we can compute this in any order, which means (a+b) + (c+d) is possible as well. 例如,如果我们有一个+ b + c + d,我们可以按任何顺序计算它,这意味着(a + b)+(c + d)也可以。 The same is applicable to this operation. 这同样适用于该操作。 However, there are some special conditions which lead to the result by different ways. 但是,有一些特殊条件以不同方式导致结果。

tmp.e and be are longs , while tmp.x and bx are doubles . tmp.ebelongs ,而tmp.xbxdoubles

Is there any form to compare all tmp.e , in for example pairs of 2 for SSE, and perform the correct computation of bx accordingly. 是否有任何形式可以比较所有tmp.e ,例如以2对为SSE,然后相应地执行bx的正确计算。 In all cases, it can be viewed as an addMul, in the first case it's just multiplying by 1, in others by 0 or BOUND. 在所有情况下,都可以将其视为addMul,在第一种情况下,它只是乘以1,在其他情况下则乘以0或BOUND。 Is it possible to vectorize this? 有可能将其向量化吗? If so, how? 如果是这样,怎么办?

Thanks. 谢谢。

void op(vec& tmp, scalar& b)
{
    for (i = 1; i < n; ++i)
    {
        if (b.e == tmp.e[i])
        {
            b.x += tmp.x[i];
            b.normalize();
            continue;
        }
        else if (b.e > tmp.e[i])
        {
            if (b.e > tmp.e[i]+1)
            {
                continue;
            }
            b.x += tmp.x[i] * BOUND;
            b.normalize();
        }
        else
        {
            if (tmp.e[i] > b.e+1)
            {
                b.x = tmp.x[i]; 
                b.e = tmp.e[i];
                b.normalize();
                continue;
            }
            b.x = b.x * BOUND + tmp.x[i];
            b.e = tmp.e[i];
            b.normalize();
        }
    }
}

Per-element conditionals in SIMD code are usually handled by using a packed-compare instruction to generate a mask of all-zero and all-one elements. SIMD代码中的每个元素条件条件通常通过使用压缩比较指令来生成全零和全一元素的掩码来处理。 You can use this to AND or OR vectors. 您可以将其用于AND或OR向量。 So eg you can increment only the elements that pass a test by using AND to produce a vector with 1 in elements that should be incremented, and 0 in elements that shouldn't, because 0 is the identity value for addition. 因此,例如,您可以通过使用AND生成通过矢量的元素,从而仅增加通过测试的元素,其中应增加的元素为1,不应该增加的元素为0,因为0是加法的标识值。 (x+0 = x). (x + 0 = x)。

You can also compute two results and then blend them together, according to a mask. 您还可以根据掩码计算两个结果,然后将它们混合在一起。 (using AND and OR, or using vector blend instructions.) (使用AND和OR,或使用矢量混合指令。)

This method of doing SIMD conditionals is like a cmov : you have to compute both sides of the branch, even if all the elements you're processing in a vector take the same side of the branch. 这种执行SIMD条件的方法就像cmov :即使要在向量中处理的所有元素都位于分支的同一侧,也必须计算分支的两侧。


It looks like your data is in struct-of-arrays format already. 看来您的数据已经是数组结构格式。 So you could generate masks from operations on vectors of e values, for use with vectors of x values. 因此,您可以根据对e值向量的操作生成掩码,以用于x值向量。 If long is 32bits, you could do a compare of 4 elements, and unpack-low and unpack-high to get two masks with 64bit elements to match your doubles. 如果long是32位,则可以对4个元素进行比较,然后对unpack-low和unpack-high进行比较,以得到两个带有64bit元素的掩码以匹配您的双精度。 If the arrays are small (so they'd fit in cache even .e[] taking as much space as .x[] ), having the longs the same as the doubles means less unpacking. 如果数组很小(因此,即使.e[]占用的空间也与.x[] ,它们也适合缓存),那么长的长度与double相同将意味着更少的拆包。


Anyway, it doesn't look promising. 无论如何,它看起来并不乐观。 Too many conditions, and I have no idea what the whole thing is really trying to accomplish, and what restrictions there might be on the input data. 太多的条件,我不知道整个过程实际上是要完成什么,以及对输入数据可能有什么限制。 If I knew more about the problem, maybe I could think of a vectorized way to do some of it. 如果我对这个问题有更多的了解,也许我可以想到一种矢量化的方法来解决一些问题。


Oh, I think another fatal flaw is that each iteration depends on the previous iteration, because it might modify b . 哦,我认为另一个致命的缺陷是,每次迭代都取决于先前的迭代,因为它可能会修改b So you can't vectorize to do multiple iterations in parallel, unless you can work out a rule for updating b based on the last vector element. 因此,除非您可以制定出基于最后一个向量元素更新b的规则,否则您不能向量化并行执行多个迭代。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM