简体   繁体   English

cuda数组排序推力,内存不足

[英]cuda array sorting with thrust, not enough memory

I'm trying to sort an array using Thrust, but it doesn't work if the array is too big.我正在尝试使用 Thrust 对数组进行排序,但如果数组太大,则它不起作用。 (I have a GTX460 1GB memory) (我有一个 GTX460 1GB 内存)

I'm using cuda with c++ integration on VS2012, Here is my code :我在 VS2012 上使用 cuda 与 c++ 集成,这是我的代码:

my .cpp我的.cpp

extern "C" void thrust_sort(uint32_t *data, int n);

int main(int argc, char **argv){
    int n = 2<<26;
    uint32_t * v = new uint32_t[n];
    srand(time(NULL));
    for (int i = 0; i < n; ++i) {
        v[i] = rand()%n;
    }

    thrust_sort(v, n);

    delete [] v;
    return 0;
}

my .cu我的.cu

extern "C"
void thrust_sort(uint32_t *data, int n){
    thrust::device_vector<uint32_t> d_data(data, data + n);
    thrust::stable_sort(d_data.begin(), d_data.end());
    thrust::copy(d_data.begin(), d_data.end(), data);
}

The program stop working at the start of stable_sort().程序在 stable_sort() 开始时停止工作。


  1. How much more memory does stable_sort() need ? stable_sort() 需要多少内存?
  2. Is there a way to fix this ?有没有办法来解决这个问题 ? (even if it makes it a bit slower or whatever) (即使它使它变慢或其他什么)
  3. Is there another sorting algorithm that doesn't require more memory than the original array ?是否有另一种排序算法不需要比原始数组更多的内存?

Thanks for your help :)谢谢你的帮助 :)

There are in the literature some techniques used to deal with the problem of sorting data that is too big to fit in RAM , such as saving partial values in files, and so on.文献中有一些技术用于处理太大而无法放入RAM数据排序问题,例如将部分值保存在文件中等。 An example: Sorting a million 32-bit integers in 2MB of RAM using Python示例: 使用 Python 在 2MB RAM 中对 100 万个 32 位整数进行排序

Your problem is less complicated since your input fits in RAM but is too much for your GPU.您的问题不那么复杂,因为您的输入适合RAM但对于您的 GPU 来说太多了。 You can solve this problem by using the strategy parallel by Regular Sampling .您可以通过使用parallel by Regular Sampling的策略来解决此问题。 You can see here an example of this technique applied to quicksort .您可以在此处看到此技术应用于quicksort的示例。

Long story short, you divide the array into smaller sub-arrays that fit on the memory of the GPU.长话短说,您将数组划分为适合 GPU 内存的较小子数组。 Then you sort each of the sub-arrays, and in the end, you merge the results base on the premises of the Regular Sampling approach.然后对每个子数组进行排序,最后,根据常规采样方法的前提合并结果。

You can use a hybrid approach, sorting some of the sub-arrays in the CPU by assigning each one to a different core (using multi-threading), and at the same time, sending others sub-arrays to the GPU.您可以使用混合方法,通过将每个子阵列分配给不同的内核(使用多线程)来对 CPU 中的一些子阵列进行排序,同时将其他子阵列发送到 GPU。 You can even subdivide this work also to different processors using a message passing interface such as MPI .您甚至可以使用诸如MPI类的消息传递接口将这项工作细分到不同的处理器。 Or you can simply sort each sub-array one-by-one on the GPU and do the final merge step using the CPU, taking (or not) advantage of the multi-cores.或者,您可以简单地在 GPU 上对每个子阵列进行一一排序,并使用 CPU 进行最后的合并步骤,利用(或不利用)多核优势。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM