简体   繁体   English

如何以最快的速度加载128位数据并与GPU(CUDA C ++)和CPU(C ++)兼容?

[英]How can I load the 128 bit data the fastest and with compatibility both GPU (CUDA C++) and with CPU (C++)?

I need to load 128 bit data per thread in CUDA C++. 我需要在CUDA C ++中为每个线程加载128位数据。 That in this case it is better to use for maximum performance and compatibility with the code for the CPU? 那在这种情况下最好使用最大的性能和与CPU代码的兼容性? Will the following examples to access the data the equal performance? 下面的示例访问数据是否具有相同的性能?

1: Use two: 1:使用两个:

unsigned __int64 src1 = arr[threadIdx.x/2];
unsigned __int64 src2 = arr[threadIdx.x/2 + 1];

2: Use: 2:使用:

struct T_src { unsigned __int64 src1, src2; };
T_src src = arr[threadIdx.x];

3: Use specific types of CUDA: 3:使用特定类型的CUDA:

ulong2 src =  arr[threadIdx.x];

Accessing memory in the GPU's "native" terms using CUDA defined types and primitives is the mostly likely way to maximize performance. 使用CUDA定义的类型和基元以GPU的“本机”术语访问内存是最大化性能的最可能方法。 This means option #3 in your question. 这意味着您的问题中的选项3。

If you intend to write code that will run on CUDA and can also run on a stand-alone CPU when recompiled, I'd suggest coding for CUDA performance first and then back-porting for host CPU execution. 如果您打算编写将在CUDA上运行并且在重新编译时也可以在独立CPU上运行的代码,我建议您先对CUDA性能进行编码,然后再反向移植以执行主机CPU。 CUDA is more picky about how things must be set up or structured than most host CPU architectures, and the performance benefits of doing things "right" for CUDA will far exceed the costs of doing things slightly suboptimal for the host CPU case. 与大多数主机CPU架构相比,CUDA在如何设置或结构化方面更具挑剔性,对于CUDA而言,“正确”做事的性能优势将远远超出对主机CPU而言稍不理想的做事成本。

I'd still use option #3 for the CUDA case and define a ulong2 structure for the host CPU case. 对于CUDA案例,我仍将使用选项#3,并为主机CPU案例定义ulong2结构。 Copying that structure around in the host CPU case will still require two (or four) memory moves behind the scenes, but it's going to require that no matter what you do in source code. 在主机CPU的情况下复制该结构仍然需要在幕后进行两次(或四次)内存移动,但是无论您在源代码中做什么,都将需要它。 Use the simplest, easiest to read and understand source style and let the compiler take care of the heavy lifting. 使用最简单,最容易阅读和理解的源代码样式,并使编译器承担繁重的工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM