简体   繁体   English

如何清除霓虹灯中的第一条非零通道?

[英]How to clear all but the first non-zero lane in neon?

I have a mask in a uint32x4_t neon register. 我在uint32x4_t氖寄存器中有一个掩码。 In this mask at least 1 of the 4 ints is set (eg 0xffffffff), however, I may have a case where there are more than one items set in the register. 在该掩码中,设置4个整数中的至少1个(例如0xffffffff),但是,我可能具有在寄存器中设置多个项目的情况。 How can I ensure that only one is set? 我怎样才能确保只设置一个?

in c pseudo code: 在c伪代码中:

uint32x4_t clearmask(uint32x4_t m)
{
         if (m[0]) { m[1] = m[2] = m[3] = 0; }
    else if (m[1]) { m[2] = m[3] = 0; }
    else if (m[2]) { m[3] = 0; }
    return m;
}

Basically I want to clear all but one of the set lanes. 基本上我想要清除除了一个设置通道之外的所有通道。 Obvious straightforward implementation in neon could be: 霓虹灯显而易见的直接实现可能是:

uint32x4_t cleanmask(uint32x4_t m)
{
    uint32x4_t mx;
    mx = vdupq_lane_u32(vget_low_u32(vmvnq_u32(m)), 0);
    mx = vsetq_lane_u32(0xffffffff, mx, 0);
    m = vandq_u32(m, mx);

    mx = vdupq_lane_u32(vget_low_u32(vmvnq_u32(m)), 1);
    mx = vsetq_lane_u32(0xffffffff, mx, 1);
    m = vandq_u32(m, mx);

    mx = vdupq_lane_u32(vget_high_u32(vmvnq_u32(m)), 0);
    mx = vsetq_lane_u32(0xffffffff, mx, 2);
    m = vandq_u32(m, mx);

    return m;
}

How can this be done more efficiently in arm neon? 如何在手臂霓虹灯中更有效地完成这项工作?

Very simple : 很简单

vceq.u32    q1, q0, #0
vmov.i8     d7, #0xff
vext.8      q2, q3, q1, #12

vand        q0, q0, q2
vand        d1, d1, d2
vand        d1, d1, d4

6 instructions total, 5 if you can keep q3 as a constant. 总共6条指令,如果你可以将q3保持为常数则为5条。

The aarch64 version below must be easier to understand: 下面的aarch64版本必须更容易理解:

cmeq    v1.4s, v0.4s, #0
movi    v31.16b, #0xff

ext     v2.16b, v31.16b, v1.16b, #12
ext     v3.16b, v31.16b, v1.16b, #8
ext     v4.16b, v31.16b, v1.16b, #4

and     v0.16b, v0.16b, v2.16b
and     v0.16b, v0.16b, v3.16b
and     v0.16b, v0.16b, v4.16b

How this works 这是如何工作的

ext / vext takes a window from the concatenation of two vectors, so we're creating masks ext / vext从两个向量的串联中获取一个窗口,因此我们创建了一个掩码

v0 = [  d   c   b   a ]

v2 = [ !c  !b  !a  -1 ]
v3 = [ !b  !a  -1  -1 ]
v4 = [ !a  -1  -1  -1 ]

The highest element ( d ) is zeroed if any of the previous elements are non-zero. 如果任何先前元素不为零,则最高元素( d )归零。

The 2nd highest element ( c ) is zeroed if any of its preceding elements ( a or b ) are non-zero. 如果其前面的任何元素( ab )中的任何a非零,则第二个最高元素( c )归零。 And so on. 等等。


With elements guaranteed to 0 or -1, mvn also works instead of a compare against zero. 如果元素保证为0或-1,则mvn也可以使用而不是与零进行比较。

I had nearly the same idea as your uncommented code: broadcast inverted elements as an AND mask to zero later elements if that one is set, otherwise leave the vector unmodified. 我和你的未注释代码几乎有相同的想法:如果设置了那个,则将反向元素作为AND掩码广播到零以后的元素,否则保持向量不被修改。

But if you're using this in a loop and have 3 spare vector registers, you can NOT all but one element with XOR, instead of MVN + set one element. 但是如果你在一个循环中使用它并且有3个备用向量寄存器,那么除了一个具有XOR的元素之外,你不能只有一个元素,而不是MVN +设置一个元素。

vdupq_lane_u32(vget_low_u32(m), 1); appears to compile efficiently as a vdup.32 q9, d16[1] , and that part of my code is the same as yours (but without the mvn). 似乎有效地编译为vdup.32 q9, d16[1] ,我的代码部分与你的相同(但没有mvn)。

Unfortunately this is a long serial dependency chain; 不幸的是,这是一个长串行依赖链; we're creating the next mask from the AND result, so there's no ILP. 我们正在从AND结果创建下一个掩码,因此没有ILP。 I don't see a good way to make this lower latency while still getting the desired result. 我没有看到一种很好的方法来降低延迟,同时仍能获得理想的结果。

uint32x4_t cleanmask_xor(uint32x4_t m)
{
    //                 {  a    b    c   d }
    uint32x4_t maska = {  0, ~0U, ~0U, ~0U};
    uint32x4_t maskb = {~0U,   0, ~0U, ~0U};
    uint32x4_t maskc = {~0U, ~0U,   0, ~0U};

    uint32x4_t tmp = vdupq_lane_u32(vget_low_u32(m), 0);
    uint32x4_t aflip = tmp ^ maska;
    m &= aflip;  // if a was non-zero, the rest are zero

    tmp = vdupq_lane_u32(vget_low_u32(m), 1);
    uint32x4_t bflip = tmp ^ maskb;
    m &= bflip;  // if b was non-zero, the rest are zero

    tmp = vdupq_lane_u32(vget_high_u32(m), 0);
    uint32x4_t cflip = tmp ^ maskc;
    m &= cflip;  // if b was non-zero, the rest are zero

    return m;
}

( Godbolt ) Godbolt

/* design notes
  [ a   b   c   d ]
  [ a  ~a  ~a  ~a ] 

&:[ a   0   0   0 ]
or[ 0   b   c   d ]

= [ e   f   g   h  ]
  [ ~f  f   ~f  ~f ]  // not b, because f can be zero when b isn't

= [ i   j   k   l ]
  ...
*/

With the loads hoisted out of a loop, this is only 9 instructions vs. 12, because we skip the vmov.32 d1[0], r3 or whatever to insert a -1 in each mask. 随着从循环中提升的负载,这只有9个指令而不是12个,因为我们跳过vmov.32 d1[0], r3或其他任何在每个掩码中插入-1 ( ANDing an element with itself is equivalent to ANDing with -1U .) veor with all-ones in the other elements replaces vmvn . (AND运算与其自身的元素等效于AND运算与-1Uveor用全1的其它元件替换vmvn

clang seems to be inefficient at loading multiple vector constants: it sets up each address separately instead of just storing them near each other where it can reach from one base pointer. clang似乎在加载多个向量常量时效率低下:它分别设置每个地址,而不是仅将它们存储在彼此靠近的地方,它可以从一个基指针到达。 So you might want to consider alternate strategies for creating the 3 constants. 因此,您可能需要考虑创建3个常量的替代策略。

#if 1
    // clang sets up the address of each constant separately
    //                 {  a    b    c   d }
    uint32x4_t maska = {  0, ~0U, ~0U, ~0U};
    uint32x4_t maskb = {~0U,   0, ~0U, ~0U};
    uint32x4_t maskc = {~0U, ~0U,   0, ~0U};
#else
    static const uint32_t maskbuf[] = 
      { -1U, -1U, 0, -1U, -1U, -1U};
    // unaligned loads.
    // or load one + shuffle?
#endif

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM