简体   繁体   中英

How can I compare lots of vectors to each other in CUDA (efficiently)

Intro

I'm trying to write a program which compares vectors to each other. I need it to compare each vector to every other vector, and return a vector c, where c[ i ] = a[ i ] / b[ i ]. So I need a vector C for each pair of vectors in the set.

Code -- Simplified

__global__
void compare_vectors(*a, *b, *c)  
    { c[ i ]  =  a[ i ] / b[ i ]  }

main()

    for(... all vectors...)  
        compare_vectors <<< grid, block >>> (n, n+1, result)

Problem

My problem is that doing it this way is slower than doing it on the CPU. Each time I iterate through the for loop, the two comparison vectors are copied to Device memory, and then the result vector is copied back to Host memory.

I want to be able to compare every vector to every other vector, but do it efficiently, and then copy all the results back at once. How can I structure this so that there aren't so many calls to cudaMemcpy ?

Info
I'm new to CUDA, so please have grace if this is super obvious.

I've gone through a number of tutorials, and searched around. But all the other examples seem to be comparing two very long vectors, not lots of smaller vectors. I have done a lot of searching and researching, but I can't find a way to do this.

I have around 2,000 vectors to compare. And each vector is compared with every other vector. So ~2,000 ^2 comparisons. Each vector is 100 - 200 floats long.

Thank you @MartinBonner and @platinum95. Drawing it out on a grid really made things more clear.

You should copy all vectors from CPU to device memory using one cudaMemcpy call, and then compute all the divisions in one kernel call. In the kernel you can launch one thread per vector, and then this thread iterates over all the other vectors and computes the division results. If your GPU supports many more than 2000 threads, then you should redesign the algorithm so that a thread iterates not all the other vectors, but only 1/10th of them, and then the other 9 threads iterate 1/10th of the vector each.

UPDATE: you don't need to transfer every pair from CPU to GPU. Just create an array with space sufficient to hold all your N vectors, each M items long, then on CPU copy N*M items one after another to this array, then call cudaMemcpy to get this array on GPU too.

tl;dr: Don't do this on a (discrete) GPU

As @talonmies suggests, this problem is not suitable for using a GPU as a coprocessor.

You see, on Intel platforms, a GPU card does not have the same kind of access to main memory that the CPU does; data must be sent to it over the PCIe bus, whose bandwidth is much lower (typical values: 12 GB/sec vs 30-40 GB/sec for accesses on the CPU). Thus, while the GPU may perform computations faster than the CPU, you only start seeing a benefit if their "density", relative to the amount of data you're working on, is high enough.

In your case, you would transfer a vector for every pair of vectors you're comparing. Even if the GPU were to perform all of its computations instantaneously, in 0 time, it would still be slower than a CPU on this problem due to the need to copy the results back.

(Also, I really doubt you need n*(n-1)/2 vectors, that sounds weird.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM