简体   繁体   中英

concurrency::parallel_sort overhead and performance hit (rule of thumb)?

Recently I stumbled across a very large performance improvement -- I'm talking about a 4x improvement -- with a one line code change. I just changed a std::sort call to concurrency_parallel sort

// Get a contiguous vector copy of the pixels from the image.

std::vector<float> vals = image.copyPixels();

// New, fast way.  Takes 7 seconds on a test image.

concurrency::parallel_buffered_sort(vals.begin(), vals.end());

// Old, slow way -- takes 30 seconds on a test image
// std::sort(vals.begin(), vals.end());

This was for a large image and dropped my processing time 30 seconds to 7 seconds. However some cases will involve small images. I don't know if I can or should just do this blindly.

I would like to make some judicious use of parallel_sort, parallel_for and the like but I'm wondering about what threshold needs to be crossed (in terms of number of elements to be sorted/iterated through) before it becomes a help and not a hindrance.

I will eventually go through some lengthy performance testing but at the moment I don't have a lot of time do that. I would like to get this working better "most" of the time and not hurting performance any of the time (or at least only rarely).

Can someone out there with some experience in this area can give me a reasonable rule-of-thumb that will help me in "most" cases? Does one exist?

The requirement of RandomIterator and presence of overloads with a const size_t _Chunk_size = 2048 parameter, which control the threshold of serialisation, imply that the library authors are aware of this concern. Thus probably just using concurrency::parallel_ * as drop-in replacements for std:: * will do fine.

Here is how I think about it, windows thread scheduling time quanta is ~20-60 ms on workstation and 120ms on the server so anything that can be done in this much time doesn't need concurrency.

So, I am guessing up to 1k-10k you are good with std::sort the latency in launching multiple threads would be an overkill, but 10k onwards there is a distinct advantage in using parallel sort or p-buffered sort (if you can afford it) and parallel radix sort probably would be great for very very large values.

Caveats apply. :o)

I don't know about that concurrency namespace, but any sane implementation of a parallel algorithm will adapt appropriately to the size of the input. You shouldn't have to worry about the details of the underlying thread implementation. Just do it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM