简体   繁体   English

试图理解 _mm256_permute2x128_si256 的英特尔内在指南解释

[英]Trying to understand Intel Intrinsics Guide explanation for _mm256_permute2x128_si256

I'm trying to understand _mm256_permute2x128_si256.我试图了解 _mm256_permute2x128_si256。 Is all 256 bits of register a read into the case first then is the 256 bits of register b read into the case after?是先将寄存器 a 的所有 256 位读入机箱,然后再将寄存器 b 的 256 位读入机箱吗? Or is every 32 bits read in interleaved between vector a and vector b?还是在向量 a 和向量 b 之间交错读取每 32 位? So which 32 bits of which vector is read in corresponding to which bit in imm8 in what order and how?那么哪个向量的哪个 32 位被读取对应于 imm8 中的哪个位以什么顺序以及如何读取? Thanks!谢谢!

DEFINE SELECT4(src1, src2, control) {
    CASE(control[1:0]) OF
    0:  tmp[127:0] := src1[127:0]
    1:  tmp[127:0] := src1[255:128]
    2:  tmp[127:0] := src2[127:0]
    3:  tmp[127:0] := src2[255:128]
    ESAC
    IF control[3]
        tmp[127:0] := 0
    FI
    RETURN tmp[127:0]
}
dst[127:0] := SELECT4(a[255:0], b[255:0], imm8[3:0])
dst[255:128] := SELECT4(a[255:0], b[255:0], imm8[7:4])
dst[MAX:256] := 0

Please see this website, it's more informative than Intel's documentation:请参阅此网站,它比英特尔的文档提供更多信息:

https://www.felixcloutier.com/x86/vperm2i128 https://www.felixcloutier.com/x86/vperm2i128

It's a shuffle that selects two 128-bit lanes from the 4 total lanes of 2 input vectors.这是一个随机播放,从 2 个输入向量的 4 个总通道中选择两个 128 位通道。
The control integer operand has two 2-bit fields that each index one of 4 lanes.控制 integer 操作数有两个 2 位字段,每个字段索引 4 个通道之一。 You could look at it as concatenating both input vectors and then indexing into that 4-lane array.您可以将其视为连接两个输入向量,然后索引到该 4 通道数组。

Or if the high bit of the index nibble is set, it zeros that lane of the result.或者,如果设置了索引半字节的高位,它会将结果的该通道归零。

There's nothing involving 32-bit granularity.没有什么涉及 32 位粒度。 The pseudo-code from the intrinsics guide defines a helper function, and passes all 256 bits of each input to that helper function twice.内在函数指南中的伪代码定义了一个帮助程序 function,并将每个输入的所有 256 位传递给该帮助程序 function 两次。 All the [hi:lo] ranges are in bits, not bytes.所有[hi:lo]范围都以位为单位,而不是字节。

Intel's asm documentation for the corresponding instructions ( vperm2i128 ) has more comprehensible pseudo-code that separates the zeroing:英特尔的相应指令 ( vperm2i128 ) 的 asm 文档具有更易于理解的伪代码,用于分隔归零:

CASE IMM8[1:0] of
    0: DEST[127:0]←SRC1[127:0]
    1: DEST[127:0]←SRC1[255:128]
    2: DEST[127:0]←SRC2[127:0]
    3: DEST[127:0]←SRC2[255:128]
ESAC

CASE IMM8[5:4] of
    0: DEST[255:128]←SRC1[127:0]
    1: DEST[255:128]←SRC1[255:128]
    2: DEST[255:128]←SRC2[127:0]
    3: DEST[255:128]←SRC2[255:128]
ESAC

IF (imm8[3])
    DEST[127:0] ← 0
FI
IF (imm8[7])
    DEST[255:128] ← 0
FI

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Visual Studio C编译器或Intel Intrinsics的AVX2“_mm256_set_epi64x”函数中的潜在错误 - Potential bug in Visual Studio C compiler or in Intel Intrinsics' AVX2 “_mm256_set_epi64x” function _mm256_extractf32x4_ps 和 _mm256_extractf128_ps 之间的区别 - Difference between _mm256_extractf32x4_ps and _mm256_extractf128_ps SIMD内在函数:_mm_stream_load_si128 vs _mm_load_si128 - SIMD intrinsics: _mm_stream_load_si128 vs _mm_load_si128 shuffle / permute内在函数如何适用于256位pd? - How do the shuffle/permute intrinsics work for 256 bit pd? 访问冲突_mm_store_si128 SSE内在函数 - access violation _mm_store_si128 SSE Intrinsics AVX 内在 _mm256_rsqrt_ps 的相对误差比根据内在指南应该有的要大得多 - The AVX intrinsic _mm256_rsqrt_ps has much greater relative error than it should have according to the intrinsics guide 使用内在函数对128、256、512位注册表进行全局按位移位? - Global bitwise shift of 128, 256, 512 bit registry using intrinsics? 未知类型名称 __m256 - 无法识别 AVX 的英特尔内在函数? - Unknown type name __m256 - Intel intrinsics for AVX not recognized? 什么是非临时流加载内在函数(_mm256_stream_load_si256)的浮点(__m256d)版本? - What is the floating-point (__m256d) version of the non-temporal streaming load intrinsic (_mm256_stream_load_si256)? g ++ - 4.8中缺少AVX日志内在函数(_mm256_log_ps)? - AVX log intrinsics (_mm256_log_ps) missing in g++-4.8?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM