不能在 Cuda kernel 中使用 __m128i

Question

I am trying to compile a simple program that uses __m128i using cuda, but when I compile using nvcc (nvcc test.cu -o test) on Linux, I get "__m128i" is a vector, which is not supported in device code .我正在尝试使用 cuda 编译一个使用__m128i的简单程序，但是当我在 Linux 上使用 nvcc (nvcc test.cu -o test) 进行编译时，我得到"__m128i" is a vector, which is not supported in device code This is the program I am trying to compile这是我要编译的程序

#include <stdio.h>
#include <emmintrin.h>

__global__ void hello(){
    printf("%d\n",threadIdx.x);
    __m128i x;

}
int main(){
   hello<<<3,3>>>();
}

When I type nvcc --version , I get Cuda compilation tools, release 10.2, V10.2.89当我输入nvcc --version时，我得到Cuda compilation tools, release 10.2, V10.2.89

I actually faced this problem on a larger scale trying to implement some cpp code using CUDA and this cpp code uses __m128i , and what I have shown is the simple version of the problem I am facing, so I am wondering if there is a way to use __m128i in a CUDA kernel, or some other alternative.实际上，我在尝试使用 CUDA 实现一些 cpp 代码时更大规模地遇到了这个问题，而这个 cpp 代码使用__m128i ，我所展示的是我所面临问题的简单版本，所以我想知道是否有办法在 CUDA kernel 或其他替代方案中使用__m128i 。 Thanks谢谢

Answer 1

I am wondering if there is a way to use __m128i in a CUDA kernel...我想知道是否有办法在 CUDA __m128i中使用 __m128i ...

There is not.那没有。 CUDA has native 128 bit integer types which meet the same alignment properties as __m128i , but a host vector type is not supported. CUDA 具有原生 128 位 integer 类型，它们满足与 __m128i 相同的__m128i属性，但不支持宿主向量类型。

or some other alternative或其他一些替代方案

As noted above, there are 16 byte aligned types which can be used to load and store data, but there is no native 128 bit SIMD intrinsic support in NVIDIA GPUs.如上所述，有 16 字节对齐类型可用于加载和存储数据，但 NVIDIA GPU 中没有原生 128 位 SIMD 内在支持。 Those SIMD instructions which exist are limited to 32 bit types.现有的那些SIMD 指令仅限于 32 位类型。

CPU SIMD is done with short vectors like 128-bit __m128i . CPU SIMD 使用 128 位__m128i等短向量完成。 GPU SIMD is done across warps, and not usually software-visible in the same way as __m128i CPU SIMD, you just write it as scalar code. GPU SIMD 是跨线程完成的，并且通常不像__m128i CPU SIMD 那样对软件可见，您只需将其编写为标量代码。

Code manually vectorized with __m128i can't be compiled for a GPU.无法为 GPU 编译使用__m128i手动矢量化的代码。 If it has a scalar fallback version, use that, eg #undef __SSE2__ .如果它有一个标量后备版本，请使用它，例如#undef __SSE2__ 。

(CUDA SIMD within 32-bit chunks lets you get more use out of the 32-bit wide ALUs in each GPU execution unit if you have narrow data, like pairs of 16-bit integers, or 4x 8-bit integers. So if your SSE intrinsics code uses _mm_add_epi8 , you might still benefit from manual vectorization in CUDA with its 4x 8-bit operations instead of 16x 8-bit.) （32 位块内的 CUDA SIMD 可以让您在每个 GPU 执行单元中更多地使用 32 位宽 ALU，如果您有窄数据，例如 16 位整数对或 4x 8 位整数。所以如果您SSE 内在代码使用_mm_add_epi8 ，您可能仍会受益于 CUDA 中的手动矢量化，其 4x 8 位操作而不是 16x 8 位。）

不能在 Cuda kernel 中使用 __m128i

问题描述

1 个解决方案

解决方案1
3

不能在 Cuda kernel 中使用 __m128i

问题描述

1 个解决方案

解决方案1 3

解决方案1
3