简体   繁体   中英

Parallelize output using OpenMP

I've written a C++ app that has to process a lot of data. Using OpenMP I parallelized the processing phase quite well and, embarrassingly, found that the output writing is now the bottleneck. I decided to use a parallel for there as well, as the order in which I output items is irrelevant; they just need to be output as coherent chunks.

Below is a simplified version of the output code, showing all the variables except for two custom iterators in the "collect data in related" loop. My question is: is this the correct and optimal way to solve this problem? I read about the barrier pragma, do I need that?

long i, n = nrows();

#pragma omp parallel for
for (i=0; i<n; i++) {
    std::vector<MyData> related;
    for (size_t j=0; j < data[i].size(); j++)
        related.push_back(data[i][j]);
    sort(related.rbegin(), related.rend());

    #pragma omp critical
    {
        std::cout << data[i].label << "\n";
        for (size_t j=0; j<related.size(); j++)
            std::cout << "    " << related[j].label << "\n";
    }
}

(I labeled this question c as I imagine OpenMP is very similar in C and C++. Please correct me if I'm wrong.)

One way to get around output contention is to write the thread-local output to a string stream, (can be done in parallel) and then push the contents to cout (requires synchronization).

Something like this:

#pragma omp parallel for
for (i=0; i<n; i++) {
    std::vector<MyData> related;
    for (size_t j=0; j < data[i].size(); j++)
        related.push_back(data[i][j]);
    sort(related.rbegin(), related.rend());

    std::stringstream buf;
    buf << data[i].label << "\n";
    for (size_t j=0; j<related.size(); j++)
        buf << "    " << related[j].label << "\n";

    #pragma omp critical
    std::cout << buf.rdbuf();
}

This offers much more fine-grained locking and the performance should increase accordingly. On the other hand, this still uses locking. So another way would be to use an array of stream buffers, one for each thread, and pushing them to cout sequentially after the parallel loop. This has the advantage of avoiding costly locks, and the output to cout must be serialized anyway.

On the other hand, you can even try to omit the critical section in the above code. In my experience, this works since the underlying streams have their own way of controlling concurrency. But I believe that this behaviour is strictly implementation defined and not portable.

cout contention is still going to be a problem here. Why not output the results in some thread-local storage and collate them to the desired location centrally, meaning no contention. For example, you could have each target thread for the parallel code write to a separate filestream or memory stream and just concatenate them afterwards, since ordering is not important. Or postprocess the results from multiple places instead of one - no contention, single write only required.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM