简体   繁体   English

Cuda内存在每个线程中共享

[英]Cuda Memory shared in every threads

I started my adventure with CUDA today. 我今天从CUDA开始冒险。 I'm trying to share an unsigned int among all the threads. 我正在尝试在所有线程之间共享一个unsigned int。 All the threads modify this value. 所有线程都会修改该值。 I copied this one value to device by using cudaMemcpy. 我使用cudaMemcpy将这个值复制到设备上。 But, at the end when calculations are finished I received that this value is equal to 0. 但是,最后,当计算完成时,我收到了该值等于0的信息。

Maybe several threads are writing to this variable at the same time? 也许有多个线程正在同时写入此变量? I'm not sure if I should use any semaphores or lock this variable when a thread starts writing or what. 我不确定在线程开始写时还是应该使用任何信号量或锁定此变量。

EDIT: 编辑:

It's hard to say in more detail because my question is in general how to solve it. 很难详细说明,因为我的问题通常是如何解决。 Actually I'm not writing any algorithm, only testing CUDA. 实际上,我没有编写任何算法,只是测试CUDA。

But if you wish... I created vector which contains some values (unsigned int). 但是,如果您愿意...我创建了包含一些值(无符号整数)的向量。 I tried to do something like searching values bigger than given shared value but, when value from vector is bigger, I'm adding 1 to the vector elements and save the shared value. 我试图做一些类似搜索大于给定共享值的值的操作,但是,当vector的值较大时,我将向vector元素加1并保存共享值。

It looks like the this: 看起来像这样:

__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;
    if (a[idx]>*b && idx < N) 
        *b = a[idx]+1;
}

As I said it's not useful code, only for testing, but I wonder how to do it... 正如我说的那样,它仅对测试而言不是有用的代码,但我不知道该怎么做...

If the value is in shared memory it will only be local to every thread that runs in a single multiprocessor(ie per thread block) and NOT to every thread that runs for that kernel. 如果该值位于共享内存中,则它将仅在单个多处理器中运行的每个线程(即每个线程块)本地,而不是在针对该内核运行的每个线程本地。 You will definitely need to perform atomic operations (such as atomicAdd etc) if you expect each thread to be writing to the variable simultanesouly. 如果您希望每个线程同时写入变量,则肯定需要执行原子操作(例如atomicAdd等)。 Be aware though that this will serialize all simultaneous thread requests for writing to the variable. 请注意,尽管这将序列化所有并发线程请求以写入变量。

"My question is in general how to use shared memory global for every threads." “我的问题通常是如何为每个线程全局使用共享内存。”

To read you don't need anything special. 阅读不需要任何特别的东西。 What you did works, faster on Fermi devices because they have a cache, slower on the others. 您所做的工作可以在Fermi设备上更快地进行,因为它们具有缓存,而在其他设备上则较慢。

If you are reading the value after other threads changed it you have no way to wait for all threads to finish their operations before reading the value you want so it might not be what you expect. 如果要在其他线程更改后读取该值,则无法等待所有线程完成其操作,然后再读取所需的值,因此可能不是您期望的值。 The only way to synchronize a value in global memory between all running threads is to use different kernels. 在所有正在运行的线程之间同步全局内存中的值的唯一方法是使用不同的内核。 After you change a value you want to share between all threads the kernel finishes and you launch a new one that will work with the shared value. 更改值后,您要在所有线程之间共享,内核完成,然后启动一个将与共享值一起使用的新线程。

To make every thread write to the same memory location you must use atomic operations but keep in mind you should keep atomic operations to a minimum as this effectively serializes the execution. 要使每个线程写入相同的内存位置,您必须使用原子操作,但是请记住,应将原子操作保持在最低限度,因为这样可以有效地序列化执行。

To know the available atomic functions read section B.11 of the CUDA C Programming Guide available here . 要知道可用的原子函数读取CUDA C编程指南的部分B.11 可以在这里找到

What you asked would be: 您要求的是:

__global__ void method(unsigned int *a, unsigned int *b, long long unsigned N) {
    int idx = blockIdx.x* blockDim.x+ threadIdx.x;
    if (a[idx]>*b && idx < N) 
        //*b = a[idx]+1;
        atomicAdd(b, a[idx]+1);
}

edit - deleted error 编辑-删除的错误

Although ideally you don't want to do this - unless you can be sure all the threads are going to take about the same time See Cuda thread tutorial 尽管理想情况下您不想这样做-除非您可以确定所有线程都将花费大约相同的时间,请参阅Cuda线程教程

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM