[英]Trying to understand Intel Intrinsics Guide explanation for _mm256_permute2x128_si256
I'm trying to understand _mm256_permute2x128_si256.我试图了解 _mm256_permute2x128_si256。 Is all 256 bits of register a read into the case first then is the 256 bits of register b read into the case after?是先将寄存器 a 的所有 256 位读入机箱,然后再将寄存器 b 的 256 位读入机箱吗? Or is every 32 bits read in interleaved between vector a and vector b?还是在向量 a 和向量 b 之间交错读取每 32 位? So which 32 bits of which vector is read in corresponding to which bit in imm8 in what order and how?那么哪个向量的哪个 32 位被读取对应于 imm8 中的哪个位以什么顺序以及如何读取? Thanks!谢谢!
DEFINE SELECT4(src1, src2, control) {
CASE(control[1:0]) OF
0: tmp[127:0] := src1[127:0]
1: tmp[127:0] := src1[255:128]
2: tmp[127:0] := src2[127:0]
3: tmp[127:0] := src2[255:128]
ESAC
IF control[3]
tmp[127:0] := 0
FI
RETURN tmp[127:0]
}
dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0])
dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4])
dst[MAX:256] := 0
Please see this website, it's more informative than Intel's documentation:请参阅此网站,它比英特尔的文档提供更多信息:
https://www.felixcloutier.com/x86/vperm2i128 https://www.felixcloutier.com/x86/vperm2i128
It's a shuffle that selects two 128-bit lanes from the 4 total lanes of 2 input vectors.这是一个随机播放,从 2 个输入向量的 4 个总通道中选择两个 128 位通道。
The control integer operand has two 2-bit fields that each index one of 4 lanes.控制 integer 操作数有两个 2 位字段,每个字段索引 4 个通道之一。 You could look at it as concatenating both input vectors and then indexing into that 4-lane array.您可以将其视为连接两个输入向量,然后索引到该 4 通道数组。
Or if the high bit of the index nibble is set, it zeros that lane of the result.或者,如果设置了索引半字节的高位,它会将结果的该通道归零。
There's nothing involving 32-bit granularity.没有什么涉及 32 位粒度。 The pseudo-code from the intrinsics guide defines a helper function, and passes all 256 bits of each input to that helper function twice.内在函数指南中的伪代码定义了一个帮助程序 function,并将每个输入的所有 256 位传递给该帮助程序 function 两次。 All the [hi:lo]
ranges are in bits, not bytes.所有[hi:lo]
范围都以位为单位,而不是字节。
Intel's asm documentation for the corresponding instructions ( vperm2i128
) has more comprehensible pseudo-code that separates the zeroing:英特尔的相应指令 ( vperm2i128
) 的 asm 文档具有更易于理解的伪代码,用于分隔归零:
CASE IMM8[1:0] of
0: DEST[127:0]←SRC1[127:0]
1: DEST[127:0]←SRC1[255:128]
2: DEST[127:0]←SRC2[127:0]
3: DEST[127:0]←SRC2[255:128]
ESAC
CASE IMM8[5:4] of
0: DEST[255:128]←SRC1[127:0]
1: DEST[255:128]←SRC1[255:128]
2: DEST[255:128]←SRC2[127:0]
3: DEST[255:128]←SRC2[255:128]
ESAC
IF (imm8[3])
DEST[127:0] ← 0
FI
IF (imm8[7])
DEST[255:128] ← 0
FI
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.