[英]How to clear all but the first non-zero lane in neon?
I have a mask in a uint32x4_t neon register. 我在uint32x4_t氖寄存器中有一个掩码。 In this mask at least 1 of the 4 ints is set (eg 0xffffffff), however, I may have a case where there are more than one items set in the register. 在该掩码中,设置4个整数中的至少1个(例如0xffffffff),但是,我可能具有在寄存器中设置多个项目的情况。 How can I ensure that only one is set? 我怎样才能确保只设置一个?
in c pseudo code: 在c伪代码中:
uint32x4_t clearmask(uint32x4_t m)
{
if (m[0]) { m[1] = m[2] = m[3] = 0; }
else if (m[1]) { m[2] = m[3] = 0; }
else if (m[2]) { m[3] = 0; }
return m;
}
Basically I want to clear all but one of the set lanes. 基本上我想要清除除了一个设置通道之外的所有通道。 Obvious straightforward implementation in neon could be: 霓虹灯显而易见的直接实现可能是:
uint32x4_t cleanmask(uint32x4_t m)
{
uint32x4_t mx;
mx = vdupq_lane_u32(vget_low_u32(vmvnq_u32(m)), 0);
mx = vsetq_lane_u32(0xffffffff, mx, 0);
m = vandq_u32(m, mx);
mx = vdupq_lane_u32(vget_low_u32(vmvnq_u32(m)), 1);
mx = vsetq_lane_u32(0xffffffff, mx, 1);
m = vandq_u32(m, mx);
mx = vdupq_lane_u32(vget_high_u32(vmvnq_u32(m)), 0);
mx = vsetq_lane_u32(0xffffffff, mx, 2);
m = vandq_u32(m, mx);
return m;
}
How can this be done more efficiently in arm neon? 如何在手臂霓虹灯中更有效地完成这项工作?
Very simple : 很简单 :
vceq.u32 q1, q0, #0
vmov.i8 d7, #0xff
vext.8 q2, q3, q1, #12
vand q0, q0, q2
vand d1, d1, d2
vand d1, d1, d4
6 instructions total, 5 if you can keep q3 as a constant. 总共6条指令,如果你可以将q3保持为常数则为5条。
The aarch64
version below must be easier to understand: 下面的aarch64
版本必须更容易理解:
cmeq v1.4s, v0.4s, #0
movi v31.16b, #0xff
ext v2.16b, v31.16b, v1.16b, #12
ext v3.16b, v31.16b, v1.16b, #8
ext v4.16b, v31.16b, v1.16b, #4
and v0.16b, v0.16b, v2.16b
and v0.16b, v0.16b, v3.16b
and v0.16b, v0.16b, v4.16b
ext
/ vext
takes a window from the concatenation of two vectors, so we're creating masks ext
/ vext
从两个向量的串联中获取一个窗口,因此我们创建了一个掩码
v0 = [ d c b a ]
v2 = [ !c !b !a -1 ]
v3 = [ !b !a -1 -1 ]
v4 = [ !a -1 -1 -1 ]
The highest element ( d
) is zeroed if any of the previous elements are non-zero. 如果任何先前元素不为零,则最高元素( d
)归零。
The 2nd highest element ( c
) is zeroed if any of its preceding elements ( a
or b
) are non-zero. 如果其前面的任何元素( a
或b
)中的任何a
非零,则第二个最高元素( c
)归零。 And so on. 等等。
With elements guaranteed to 0 or -1, mvn
also works instead of a compare against zero. 如果元素保证为0或-1,则mvn
也可以使用而不是与零进行比较。
I had nearly the same idea as your uncommented code: broadcast inverted elements as an AND mask to zero later elements if that one is set, otherwise leave the vector unmodified. 我和你的未注释代码几乎有相同的想法:如果设置了那个,则将反向元素作为AND掩码广播到零以后的元素,否则保持向量不被修改。
But if you're using this in a loop and have 3 spare vector registers, you can NOT all but one element with XOR, instead of MVN + set one element. 但是如果你在一个循环中使用它并且有3个备用向量寄存器,那么除了一个具有XOR的元素之外,你不能只有一个元素,而不是MVN +设置一个元素。
vdupq_lane_u32(vget_low_u32(m), 1);
appears to compile efficiently as a vdup.32 q9, d16[1]
, and that part of my code is the same as yours (but without the mvn). 似乎有效地编译为vdup.32 q9, d16[1]
,我的代码部分与你的相同(但没有mvn)。
Unfortunately this is a long serial dependency chain; 不幸的是,这是一个长串行依赖链; we're creating the next mask from the AND result, so there's no ILP. 我们正在从AND结果创建下一个掩码,因此没有ILP。 I don't see a good way to make this lower latency while still getting the desired result. 我没有看到一种很好的方法来降低延迟,同时仍能获得理想的结果。
uint32x4_t cleanmask_xor(uint32x4_t m)
{
// { a b c d }
uint32x4_t maska = { 0, ~0U, ~0U, ~0U};
uint32x4_t maskb = {~0U, 0, ~0U, ~0U};
uint32x4_t maskc = {~0U, ~0U, 0, ~0U};
uint32x4_t tmp = vdupq_lane_u32(vget_low_u32(m), 0);
uint32x4_t aflip = tmp ^ maska;
m &= aflip; // if a was non-zero, the rest are zero
tmp = vdupq_lane_u32(vget_low_u32(m), 1);
uint32x4_t bflip = tmp ^ maskb;
m &= bflip; // if b was non-zero, the rest are zero
tmp = vdupq_lane_u32(vget_high_u32(m), 0);
uint32x4_t cflip = tmp ^ maskc;
m &= cflip; // if b was non-zero, the rest are zero
return m;
}
/* design notes
[ a b c d ]
[ a ~a ~a ~a ]
&:[ a 0 0 0 ]
or[ 0 b c d ]
= [ e f g h ]
[ ~f f ~f ~f ] // not b, because f can be zero when b isn't
= [ i j k l ]
...
*/
With the loads hoisted out of a loop, this is only 9 instructions vs. 12, because we skip the vmov.32 d1[0], r3
or whatever to insert a -1
in each mask. 随着从循环中提升的负载,这只有9个指令而不是12个,因为我们跳过vmov.32 d1[0], r3
或其他任何在每个掩码中插入-1
。 ( ANDing an element with itself is equivalent to ANDing with -1U
.) veor
with all-ones in the other elements replaces vmvn
. (AND运算与其自身的元素等效于AND运算与-1U
) veor
用全1的其它元件替换vmvn
。
clang seems to be inefficient at loading multiple vector constants: it sets up each address separately instead of just storing them near each other where it can reach from one base pointer. clang似乎在加载多个向量常量时效率低下:它分别设置每个地址,而不是仅将它们存储在彼此靠近的地方,它可以从一个基指针到达。 So you might want to consider alternate strategies for creating the 3 constants. 因此,您可能需要考虑创建3个常量的替代策略。
#if 1
// clang sets up the address of each constant separately
// { a b c d }
uint32x4_t maska = { 0, ~0U, ~0U, ~0U};
uint32x4_t maskb = {~0U, 0, ~0U, ~0U};
uint32x4_t maskc = {~0U, ~0U, 0, ~0U};
#else
static const uint32_t maskbuf[] =
{ -1U, -1U, 0, -1U, -1U, -1U};
// unaligned loads.
// or load one + shuffle?
#endif
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.