尝试使用x86 asm SSSE3将大字节序转换为小字节序

Question

I've have been doing arm asm for a while and tried to optimize simple loops with x86 asm ssse3. 我已经做了一段时间的arm asm，并尝试使用x86 asm ssse3优化简单的循环。 I cannot find a way to convert big endian to little endian. 我找不到将大字节序转换为小字节序的方法。

ARM NEON has a single vector instruction to do exactly this, but SSSE3 does not. ARM NEON仅具有一个向量指令即可完成此操作，而SSSE3没有。 I tried to use 2 shifts and an or but that requires to go to 32bit per slot instead of 16 if we are shifting by 8 to the left (data gets saturated). 我尝试使用2个移位，或一个或，但是如果我们向左移动8个（数据饱和），则需要将每个插槽的位数改为32位而不是16位。

I looked into PSHUFB but when I use it, the first half of 16 bit word is always 0. 我调查了PSHUFB，但使用它时，16位字的前半部分始终为0。

I am using inline asm on x86 for android. 我在Android的x86上使用嵌入式asm。 Sorry for the incorrect syntax or other errors that may occur, please understand what I mean (it is hard to rip this out of my code). 很抱歉出现不正确的语法或其他错误，请理解我的意思（很难将其从我的代码中删除）。

# Data
uint16_t dataSrc[] = {0x7000, 0x4401, 0x3801, 0xf002, 0x4800, 0xb802, 0x1800, 
0x3c00, 0xd800.....
uint16_t* src = dataSrc;
uint8_t * dst = new uint8_t[16] = {0};
uint8_t * map = new uint8_t[16] = { 9,8, 11,10, 13,12, 15,14, 1,0,3,2,5,4,7,6,};

# I need to convert 0x7000 to 0x0077 by shifting each 16 bit by its byte vectorized.

asm volatile (
        "movdqu     (%0),%%xmm1\n"
        "pshufb     %2,%%xmm1\n"
        "movdqu     %%xmm1,(%1)\n"
:   "+r" (src),
"+r" (dst),
"+r" (map)
:
:   "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4"
);

If I loop through the dataSrc variable my output for the first 8 bytes are: 如果我遍历dataSrc变量，则前8个字节的输出为：

Only the last 4 are swapped even if it is in the wrong order. 即使最后4个顺序错误，也只能交换。 Why are the first 4 all zeros? 为什么前4个全为零？ No matter how i change the map, the first is sometimes 0 and the next 3 are always zero, why? 无论我如何更改地图，第一个有时有时为0，接下来的3总是为零，为什么？ Am i doing something wrong? 难道我做错了什么？

Edit 编辑

I figured out why it didn't work, the map did not pass into the inline asm correctly, I had to free an input variable for it. 我弄清楚了为什么它不起作用，地图没有正确传递到内联汇编中，我不得不为其释放一个输入变量。

For other questions about intrisics vs hand written asm. 对于本征与手写汇编的其他问题。 The code below is to convert 16-byte video frame data YUV42010BE to YUVP420 (8 bit), the problem is with shuffle, if I use little endian, then i would not have that section. 下面的代码是将16字节视频帧数据YUV42010BE转换为YUVP420（8位），问题是随机播放，如果我使用little endian，那么我将没有该部分。

static const char map[16] = { 9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6 };
int dstStrideOffset = (dstStride - srcStride / 2);
asm volatile (
    "push       %%ebp\n"

    // All 0s for packing
    "xorps      %%xmm0, %%xmm0\n"

    "movdqu     (%5),%%xmm4\n"

    "yloop:\n"

    // Set the counter for the stride
    "mov %2,    %%ebp\n"

    "xloop:\n"

    // Load source data
    "movdqu     (%0),%%xmm1\n"
    "movdqu     16(%0),%%xmm2\n"
    "add        $32,%0\n"

    // The first 4 16-bytes are 0,0,0,0, this is the issue.
    "pshufb      %%xmm4, %%xmm1\n"
    "pshufb      %%xmm4, %%xmm2\n"

    // Shift each 16 bit to the right to convert
    "psrlw      $0x2,%%xmm1\n"
    "psrlw      $0x2,%%xmm2\n"

    // Merge both 16bit vectors into 1 8bit vector
    "packuswb   %%xmm0, %%xmm1\n"
    "packuswb   %%xmm0, %%xmm2\n"
    "unpcklpd   %%xmm2, %%xmm1\n"

    // Write the data
    "movdqu     %%xmm1,(%1)\n"
    "add        $16, %1\n"

    // End loop, x = srcStride; x >= 0 ; x -= 32
    "sub        $32, %%ebp\n"
    "jg         xloop\n"

    // End loop, y = height; y >= 0; --y
    "add %4,    %1\n"
    "sub $1,    %3\n"
    "jg         yloop\n"

    "pop        %%ebp\n"
:   "+r" (src),
    "+r" (dst),
    "+r" (srcStride),
    "+r" (height),
    "+r"(dstStrideOffset)
:   "x"(map)
:   "memory", "cc", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4"
);

I didn't get around to implement the shuffle for intrinsics yet, using little endian 我还没有使用Little Endian来实现内在函数的改组

const int dstStrideOffset = (dstStride - srcStride / 2);
__m128i mdata, mdata2;
const __m128i zeros = _mm_setzero_si128();
for (int y = height; y > 0; --y) {
    for (int x = srcStride; x > 0; x -= 32) {
        mdata = _mm_loadu_si128((const __m128i *)src);
        mdata2 = _mm_loadu_si128((const __m128i *)(src + 8));
        mdata = _mm_packus_epi16(_mm_srli_epi16(mdata, 2), zeros);
        mdata2 = _mm_packus_epi16(_mm_srli_epi16(mdata2, 2), zeros);
        _mm_storeu_si128( (__m128i *)dst, static_cast<__m128i>(_mm_unpacklo_pd(mdata, mdata2)));
        src += 16;
        dst += 16;
    }
    dst += dstStrideOffset;
}

Probably not written correctly but benchmarking on Android emulator (API 27), x86 (SSSE3 is the highest, i686) with default compiler settings and added optimizations such (although made no difference in performance) -Ofast -O3 -funroll-loops -mssse3 -mfpmath=sse on average: 可能未正确编写，但在Android模拟器（API 27）上进行了基准测试，x86（SSSE3最高，i686）具有默认的编译器设置，并添加了诸如此类的优化（尽管性能没有差异） -Ofast -O3 -funroll-loops -mssse3- mfpmath = sse平均：

Intrinics: 1.9-2.1 ms Hand written: 0.7-1ms 内部特征：1.9-2.1毫秒手写：0.7-1毫秒

Is there a way to speed this up? 有没有办法加快速度？ Maybe I wrote the intrisics wrong, is it possible to get closer speeds to hand written with intrinics? 也许我写了本征函数是错误的，是否有可能使本征函数的手写速度更快？

Answer 1

Your code doesn't work because you pass the address of map to pshufb . 您的代码无效，因为您将map的地址传递给了pshufb 。 I'm not sure what code gcc generates for this, I can't imagine this compiles at all. 我不确定gcc为此会生成什么代码，我无法想象它会完全编译。

It is usually not a good idea to use inline assembly for this sort of thing. 通常将内联汇编用于此类事情不是一个好主意。 Instead, use intrinsic functions: 而是使用内部函数：

#include <immintrin.h>

void byte_swap(char dst[16], const char src[16])
{
    __m128i msrc, map, mdst;

    msrc = _mm_loadu_si128((const _m128i *)src);
    map = _mm_setr_epi8(9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6);
    mdst = _mm_shuffle_epi8(msrc, map);
    _mm_storeu_si128((_m128i *)dst, mdst);
}

Apart from being easier to maintain, this optimizes better because unlinke inline assembly, the compiler can introspect intrinsic functions and make informed decisions about which instructions to emit. 除了易于维护之外，由于取消了内联汇编的链接，因此优化效果更好，因为编译器可以内省内部函数并做出明智的决策，以发出哪些指令。 For example, on an AVX target, it might emit the VEX-encoded vpshufb instead of pshufb to avoid a stall due to an AVX/SSE transition. 例如，在AVX目标上，它可能会发出VEX编码的vpshufb而不是pshufb以避免由于AVX / SSE转换而停顿。

If for any reason you cannot use intrinsic functions, use inline assembly like this: 如果由于某种原因您不能使用内部函数，请使用如下内联汇编：

void byte_swap(char dst[16], const char src[16])
{
    typedef long long __m128i_u __attribute__ ((__vector_size__ (16), __may_alias__, __aligned__ (1)));
    static const char map[16] = { 9, 8, 11, 10, 13, 12, 15, 14, 1, 0, 3, 2, 5, 4, 7, 6 };
    __m128i_u data = *(const __m128i_u *)src;

    asm ("pshufb %1, %0" : "+x"(data) : "xm"(* (__m128i_u *)map));
   *(__m128i_u *)dst = data;
}

尝试使用x86 asm SSSE3将大字节序转换为小字节序

问题描述

1 个解决方案

解决方案1
2 2018-08-28 09:45:52

尝试使用x86 asm SSSE3将大字节序转换为小字节序

问题描述

1 个解决方案

解决方案1 2 2018-08-28 09:45:52

解决方案1
2 2018-08-28 09:45:52