简体   繁体   English

`bit_cast` arrays 到 arrays

[英]`bit_cast` arrays to arrays

Is bit_cast ing arrays of one type to another needed to avoid UB?是否bit_cast将一种类型的 arrays 转换为另一种类型以避免 UB? For example, I have a function比如我有一个function

void func(std::vector<int32_t>& dest, std::vector<std::byte>& src, const long stride){
    const auto ptr = reinterpret_cast<std::byte *>(dest.data());
    for (std::size_t i = 0; i < src.size(); ++i) {
        const auto t = ptr + 4 * i;
        t[0] = src[i];
        t[1] = src[i + stride];
        t[2] = src[i + 2 * stride];
        t[3] = src[i + 3 * stride];
    }
}

Do I need to use bit_cast instead?我需要改用bit_cast吗?

void func2(std::vector<int32_t>& dest, std::vector<std::byte>& src, const long stride){
    for (std::size_t i = 0; i < src.size(); ++i) {
        alignas(std::int32_t) std::array<std::byte, 4> t;
        t[0] = src[i];
        t[1] = src[i + stride];
        t[2] = src[i + 2 * stride];
        t[3] = src[i + 3 * stride];
        dest[i] = std::bit_cast<std::int32_t>(t);
    }
}

Or a use memcpy ?或者使用memcpy

void func3(std::vector<int32_t>& dest, std::vector<std::byte>& src, const long stride){
    for (std::size_t i = 0; i < src.size(); ++i) {
        alignas(std::int32_t) std::byte t[4];
        t[0] = src[i];
        t[1] = src[i + stride];
        t[2] = src[i + 2 * stride];
        t[3] = src[i + 3 * stride];
        std::memcpy(&dest[i], t, sizeof t);
    }
}

From my tests, the bit_cast and memcpy seems to have some overhead and the generated asm code is different which we wound expect to be the same for scalar types https://godbolt.org/z/Y1W585EWY从我的测试来看, bit_castmemcpy似乎有一些开销,并且生成的 asm 代码不同,我们希望标量类型相同https://godbolt.org/z/Y1W585EWY

I don't know if its UB in there but if you can use unsigned version, you can convert this part:我不知道它的 UB 是否在那里,但如果你可以使用未签名的版本,你可以转换这部分:

t[0] = src[i];
t[1] = src[i + stride];
t[2] = src[i + 2 * stride];
t[3] = src[i + 3 * stride];

to this:对此:

dest[i] = src[i] +   
          src[i+stride]   * (uint32_t) 256    + 
          src[i+stride*2] * (uint32_t) 65536  +
          src[i+stride*3] * (uint32_t) 16777216;

If you need speedup, you can vectorize the operation:如果你需要加速,你可以向量化操作:

// for avx512
vector1 = src[i] to src[i+16]
vector2 = src[i+stride] to src[i+stride+16]
vector3 = src[i+stride*2] to src[i+stride*2+16]
vector4 = src[i+stride*3] to src[i+stride*3+16]

then join them the same way but in vectorized form.然后以相同的方式加入它们,但以矢量化形式。

// either a 16-element vector extension
destVect[i] = vector1 + vector2*256 + ....
// or just run over 16-elements at a time like tiled-computing
for(j from 0 to 15)
   destVect[i+j] = ...

Maybe you don't even need explicit use of intrinsics.也许您甚至不需要显式使用内在函数。 Just try with simple loops working on arrays (vector) of simd-width number of elements but generally encapsulation adds code bloat so you may need to do it on bare plain arrays on stack.只需尝试在 arrays(向量)的 simd-width 元素数量上使用简单循环,但通常封装会增加代码膨胀,因此您可能需要在堆栈上的裸露 arrays 上进行。

Some compilers have a default minimum number of loop iterations to vectorize so you should test it with different tile width or apply a compiler flag that lets it vectorize even small loops.一些编译器有一个默认的最小循环迭代次数来向量化,所以你应该用不同的图块宽度测试它,或者应用一个编译器标志,让它向量化即使是小循环。

Here is a sample solution from a toy auto-vectorized SIMD library: https://godbolt.org/z/qMWsbsrG8这是来自玩具自动矢量化 SIMD 库的示例解决方案: https://godbolt.org/z/qMWsbsrG8

Output: Output:

1000 operations took 5518 nanoseconds

this is with all the stack-array allocation + kernel-launch overheads.这是所有堆栈数组分配+内核启动开销。

For 10000 operations ( https://godbolt.org/z/Mz1K75Kj1 ) it takes 2.4 nanosecond per operation.对于 10000 次操作 ( https://godbolt.org/z/Mz1K75Kj1 ),每次操作需要 2.4 纳秒。

Here is 1000 operations but by using only 16 work-items (single ZMM register on AVX512): https://godbolt.org/z/r9GTfffG8这是 1000 个操作,但仅使用 16 个工作项(AVX512 上的单个 ZMM 寄存器): https://godbolt.org/z/r9GTfffG8

simd*1000 operations took 20551 nanoseconds

this is 1.25 nanoseconds per operation (at least on godbolt.org server).这是每次操作 1.25 纳秒(至少在 godbolt.org 服务器上)。 For FX8150 and a narrower simd value, it has ~6.9 nanoseconds per operation.对于 FX8150 和更窄的 simd 值,每次操作大约需要 6.9 纳秒。 If you write a non-encapsulation version, it should create less code bloat as a result and be faster.如果您编写一个非封装版本,它应该会因此产生更少的代码膨胀并且速度更快。

Lastly, use multiple iterations for benchmarking: https://godbolt.org/z/dn9vj9seP最后,使用多次迭代进行基准测试: https://godbolt.org/z/dn9vj9seP

simd*1000 operations took 12367 nanoseconds
simd*1000 operations took 12420 nanoseconds
simd*1000 operations took 12118 nanoseconds
simd*1000 operations took 2753 nanoseconds
simd*1000 operations took 2694 nanoseconds
simd*1000 operations took 2691 nanoseconds
simd*1000 operations took 2839 nanoseconds
simd*1000 operations took 2698 nanoseconds
simd*1000 operations took 2702 nanoseconds
simd*1000 operations took 2711 nanoseconds
simd*1000 operations took 2718 nanoseconds
simd*1000 operations took 2710 nanoseconds

this is 0.17 nanoseconds per operation.这是每次操作0.17 纳秒 23GB/s is not too bad for a client-shared server RAM + simple loop. 23GB/s 对于客户端共享服务器 RAM + 简单循环来说还算不错。 Explicitly using AVX intrinsics and no encapsulation should get you maximum bandwidth of the L1/L2/L3 caches or RAM (depends on dataset size).显式使用 AVX 内在函数并且不进行封装应该可以获得 L1/L2/L3 缓存或 RAM 的最大带宽(取决于数据集大小)。 But beware, if you do this on a real work server with client-sharing, then your neighbors will feel the turbo-underclock during the AVX512-accelerated computations (unless integer doesn't count as a heavy-load for AVX512 pipelines).但请注意,如果您在具有客户端共享的真实工作服务器上执行此操作,那么您的邻居将在 AVX512 加速计算期间感受到涡轮降频(除非 integer 不算作 AVX512 管道的重负载)。

Compilers will behave differently.编译器的行为会有所不同。 For example, clang produces this:例如,clang 产生这个:

    .LBB0_4:
    vpslld  zmm1, zmm1, 8 // evil integer bit level hacking?
    vpslld  zmm2, zmm2, 16
    vpslld  zmm3, zmm3, 24
    vpord   zmm0, zmm1, zmm0
    vpternlogd      zmm3, zmm0, zmm2, 254 // what the code?
    vmovdqu64       zmmword ptr [r14 + 4*rax], zmm3
    add     rax, 16
    cmp     rax, 16000
    jne     .LBB0_4

while gcc produces much more code bloat.而 gcc 会产生更多的代码膨胀。 I don't know why.我不知道为什么。

(you also have overflow in src vector by iterating to last element and adding stride) (您还通过迭代到最后一个元素并添加步幅在 src 向量中溢出)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM