简体   繁体   English

如何在不使用任何SSE指令的情况下设置__m128i?

[英]How can I set __m128i without using of any SSE instruction?

I have many function which use the same constant __m128i values. 我有很多函数使用相同的常量__m128i值。 For example: 例如:

const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8);
const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4);

So I want to store all these constants in an one place. 所以我想将所有这些常量存储在一个地方。 But there is a problem: I perform checking of existed CPU extension in run time. 但是有一个问题:我在运行时检查现有的CPU扩展。 If the CPU doesn't support for example SSE (or AVX) than will be a program crash during constants initialization. 如果CPU不支持例如SSE(或AVX),那么在常量初始化期间程序将崩溃。

So is it possible to initialize these constants without using of SSE? 那么可以在不使用SSE的情况下初始化这些常量吗?

Initialization of __m128i vector without using SSE instructions is possible but it depends on how to compiler defines __m128i. 可以在不使用SSE指令的情况下初始化__m128i向量,但这取决于编译器如何定义__m128i。

For Microsoft Visual Studio you can define next macros (it defines __m128i as char[16]): 对于Microsoft Visual Studio,您可以定义下一个宏(它将__m128i定义为char [16]):

template <class T> inline char GetChar(T value, size_t index)
{
    return ((char*)&value)[index];
}

#define AS_CHAR(a) char(a)

#define AS_2CHARS(a) \
    GetChar(int16_t(a), 0), GetChar(int16_t(a), 1)

#define AS_4CHARS(a) \
    GetChar(int32_t(a), 0), GetChar(int32_t(a), 1), \
    GetChar(int32_t(a), 2), GetChar(int32_t(a), 3)

#define _MM_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aa, ab, ac, ad, ae, af) \
    {AS_CHAR(a0), AS_CHAR(a1), AS_CHAR(a2), AS_CHAR(a3), \
     AS_CHAR(a4), AS_CHAR(a5), AS_CHAR(a6), AS_CHAR(a7), \
     AS_CHAR(a8), AS_CHAR(a9), AS_CHAR(aa), AS_CHAR(ab), \
     AS_CHAR(ac), AS_CHAR(ad), AS_CHAR(ae), AS_CHAR(af)}

#define _MM_SETR_EPI16(a0, a1, a2, a3, a4, a5, a6, a7) \
    {AS_2CHARS(a0), AS_2CHARS(a1), AS_2CHARS(a2), AS_2CHARS(a3), \
     AS_2CHARS(a4), AS_2CHARS(a5), AS_2CHARS(a6), AS_2CHARS(a7)}

#define _MM_SETR_EPI32(a0, a1, a2, a3) \
    {AS_4CHARS(a0), AS_4CHARS(a1), AS_4CHARS(a2), AS_4CHARS(a3)}       

For GCC it will be (it defines __m128i as long long[2]): 对于GCC,它将(它将__m128i定义为long long [2]):

#define CHAR_AS_LONGLONG(a) (((long long)a) & 0xFF)

#define SHORT_AS_LONGLONG(a) (((long long)a) & 0xFFFF)

#define INT_AS_LONGLONG(a) (((long long)a) & 0xFFFFFFFF)

#define LL_SETR_EPI8(a, b, c, d, e, f, g, h) \
    CHAR_AS_LONGLONG(a) | (CHAR_AS_LONGLONG(b) << 8) | \
    (CHAR_AS_LONGLONG(c) << 16) | (CHAR_AS_LONGLONG(d) << 24) | \
    (CHAR_AS_LONGLONG(e) << 32) | (CHAR_AS_LONGLONG(f) << 40) | \
    (CHAR_AS_LONGLONG(g) << 48) | (CHAR_AS_LONGLONG(h) << 56)

#define LL_SETR_EPI16(a, b, c, d) \
    SHORT_AS_LONGLONG(a) | (SHORT_AS_LONGLONG(b) << 16) | \
    (SHORT_AS_LONGLONG(c) << 32) | (SHORT_AS_LONGLONG(d) << 48)

#define LL_SETR_EPI32(a, b) \
    INT_AS_LONGLONG(a) | (INT_AS_LONGLONG(b) << 32)        

#define _MM_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aa, ab, ac, ad, ae, af) \
    {LL_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7), LL_SETR_EPI8(a8, a9, aa, ab, ac, ad, ae, af)}

#define _MM_SETR_EPI16(a0, a1, a2, a3, a4, a5, a6, a7) \
    {LL_SETR_EPI16(a0, a1, a2, a3), LL_SETR_EPI16(a4, a5, a6, a7)}

#define _MM_SETR_EPI32(a0, a1, a2, a3) \
    {LL_SETR_EPI32(a0, a1), LL_SETR_EPI32(a2, a3)}        

So in your code initialization of __m128i constant will be look like: 所以在你的代码初始化__m128i常量将是这样的:

const __m128i K8 = _MM_SETR_EPI8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
const __m128i K16 = _MM_SETR_EPI16(1, 2, 3, 4, 5, 6, 7, 8);
const __m128i K32 = _MM_SETR_EPI32(1, 2, 3, 4);

I suggest defining the initialisation data globally as scalar data and then load it locally into a const __m128i : 我建议将初始化数据全局定义为标量数据,然后将其本地加载到const __m128i

static const uint8_t gK8[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

static inline foo()
{
    const __m128i K8 = _mm_loadu_si128((__m128i *)gK8);

    // ...
}

You can use a union. 你可以使用联盟。

union M128 {
   char[16] i8;
   __m128i i128;
};

const M128 k8 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

If the M128 union is defined locally where you use the loop, this should have no performance overhead (it will be loaded in memory once at the begin of the loop). 如果M128联合在本地定义,您使用循环,这应该没有性能开销(它将在循环开始时加载到内存中)。 Because it contains a variable of type __m128i, M128 inherits the correct alignment. 因为它包含__m128i类型的变量,所以M128继承了正确的对齐方式。

void foo()
{
   M128 k8 = ...;
   // use k8.i128 in your for loop
}

If it is defined somewhere else, then you need to copy into a local register before you start the loop, otherwise the compiler may not be able to optimize it. 如果它在其他地方定义,则需要在启动循环之前复制到本地寄存器,否则编译器可能无法对其进行优化。

void foo()
{
    __m128i tmp = k8.i128;
    // for loop here
}

This will load k8 into a cpu register and keep it there for the duration of the loop, as long as there enough free registers to carry out the loop body. 这将把k8加载到cpu寄存器并在循环期间保持在那里,只要有足够的空闲寄存器来执行循环体。

Depending on what compiler you use, these unions may be already defined (VS does), but the compiler's provided definitions may not be portable. 根据您使用的编译器,这些联合可能已经定义(VS确实),但编译器提供的定义可能不可移植。

You usually don't need this. 你通常不需要这个。 Compilers are very good at using the same storage for multiple functions that use the same constant. 编译器非常擅长将相同的存储用于使用相同常量的多个函数。 Just like merging multiple instances of the same string literal into one string constant, multiple instances of the same _mm_set* in different functions will all load from the same vector constant (or generate on the fly for _mm_setzero_si128() or _mm_set1_epi8(-1) ). 就像合并相同的字符串文字的多个实例为一个字符串常量,多个实例相同的_mm_set*在不同的功能由相同的矢量常数所有负载(或在飞行中产生用于_mm_setzero_si128()_mm_set1_epi8(-1)

Using Godbolt's binary output (disassembly) mode lets you see whether different functions are loading from the same block of memory or not. 使用Godbolt的二进制输出(反汇编)模式可以查看是否从同一块内存加载不同的函数。 Look at the comment it adds, which resolves the RIP-relative addresses to absolute addresses. 查看它添加的注释,它将RIP相对地址解析为绝对地址。

  • gcc: all identical constants share the same storage , regardless of whether they're from auto-vectorization or _mm_set . gcc: 所有相同的常量共享相同的存储 ,无论它们来自自动矢量化还是_mm_set 32B constants can't overlap with 16B constants, even if the 16B constant is a subset of the 32B. 即使16B常数是32B的子集,32B常数也不能与16B常数重叠。

  • clang: identical constants share storage . clang: 相同的常量共享存储 16B and 32B constants don't overlap, even when one is a subset of the other. 16B和32B常数不重叠,即使一个是另一个的子集。 Some functions using repetitive constants use an AVX2 vpbroadcastd broadcast-load (which doesn't even take an ALU uop on Intel SnB-family CPUs). 一些使用重复常量的函数使用AVX2 vpbroadcastd广播加载(它甚至不会在Intel SnB系列CPU上使用ALU uop)。 For some reason, it chooses to do this based on the element size of the operation, not the repetitivity of the constant. 出于某种原因,它选择基于操作的元素大小而不是常量的重复性来执行此操作。 Note that clang's asm output repeats the constant for each use, but the final binary doesn't. 请注意,clang的asm输出会为每次使用重复常量,但最终的二进制不会。

  • MSVC: identical constants share storage . MSVC: 相同的常量共享存储 Pretty much the same as what gcc does. 与gcc的功能几乎相同。 (The full asm output is hard to wade through; use search. I could only get the asm at all by having main find the path to the .exe, then work out the path to the asm output made with cl.exe -O2 /FAs , and run system("type .../foo.asm") ). (完整的asm输出很难通过;使用搜索。我只能通过main找到.exe的路径来获取asm,然后计算出使用cl.exe -O2 /FAs生成的asm输出的路径cl.exe -O2 /FAs和运行system("type .../foo.asm") )。

Compiler are good at this, since it's not a new problem. 编译器很擅长这个,因为它不是一个新问题。 It's existed with strings since the earliest days of compilers. 从编译器的早期开始就存在字符串。

I haven't checked if this works across source files (eg for an inline vector function used in multiple compilation units). 我没有检查它是否适用于源文件(例如,对于多个编译单元中使用的内联向量函数)。 If you do still want static / global vector constants, see below: 如果仍然想静态/全球向量常数,见下图:


It appears there is no easy and portable way to statically initialize an static/global __m128 . 这似乎没有一种简单方便的静态初始化静态/全局__m128 C compilers won't even accept _mm_set* as an initializer, because it works like a function. C编译器甚至不接受_mm_set*作为初始化器,因为它的功能类似于函数。 They don't take advantage of the fact that they could actually see through it to a compile-time-constant 16B 他们没有利用他们实际上可以通过它看到编译时常量16B的事实

const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4);   // Illegal in C
// C++: generates a constructor that copies from .rodata to the BSS

Even though the constructor only requires SSE1 or SSE2, you don't want this anyway. 尽管构造函数只需要SSE1或SSE2,但无论如何都不需要它。 It's horrible. 这太糟糕了。 DON'T DO THIS . 不要这样做 You end up paying the memory cost of your constants twice. 您最终会两次支付常量的内存成本。


Fabio's union answer looks like the best portable way to statically initialize a vector constant, but it means you have to access the __m128i union member. Fabio的union答案看起来是静态初始化向量常量的最佳可移植方式,但这意味着您必须访问__m128i联合成员。 It may help with grouping related constants near each other (hopefully in the same cache line) even if they're used by scattered functions. 它可能有助于将相关常量分组到彼此附近(希望在同一缓存行中),即使它们被分散的函数使用也是如此。 There are non-portable ways to accomplish, that, too (eg put related constants in their own ELF section with GNU C __attribute__ ((section ("constants_for_task_A"))) ). 还有不可移植的方法,也就是说(例如,将相关的常量放在他们自己的ELF部分中,使用GNU C __attribute__ ((section ("constants_for_task_A"))) )。 Hopefully that can group them together in the .rodata section (which becomes part of the .text section). 希望可以将它们组合在.rodata部分(它成为.text部分的一部分)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM