简体   繁体   English

使用OpenMP并行化输出

[英]Parallelize output using OpenMP

I've written a C++ app that has to process a lot of data. 我编写了一个必须处理大量数据的C ++应用程序。 Using OpenMP I parallelized the processing phase quite well and, embarrassingly, found that the output writing is now the bottleneck. 使用OpenMP我很好地并行化了处理阶段,并且令人尴尬地发现输出写入现在是瓶颈。 I decided to use a parallel for there as well, as the order in which I output items is irrelevant; 我决定在那里使用parallel for ,因为我输出项目的顺序是无关紧要的; they just need to be output as coherent chunks. 它们只需要输出为连贯的块。

Below is a simplified version of the output code, showing all the variables except for two custom iterators in the "collect data in related" loop. 下面是输出代码的简化版本,显示除“两个自定义迭代器”中的两个自定义迭代器之外的所有变量。 My question is: is this the correct and optimal way to solve this problem? 我的问题是:这是解决这个问题的正确和最佳方法吗? I read about the barrier pragma, do I need that? 我读到了关于barrier ,我需要它吗?

long i, n = nrows();

#pragma omp parallel for
for (i=0; i<n; i++) {
    std::vector<MyData> related;
    for (size_t j=0; j < data[i].size(); j++)
        related.push_back(data[i][j]);
    sort(related.rbegin(), related.rend());

    #pragma omp critical
    {
        std::cout << data[i].label << "\n";
        for (size_t j=0; j<related.size(); j++)
            std::cout << "    " << related[j].label << "\n";
    }
}

(I labeled this question c as I imagine OpenMP is very similar in C and C++. Please correct me if I'm wrong.) (我将这个问题标记为c因为我认为OpenMP在C和C ++中非常相似。如果我错了,请纠正我。)

One way to get around output contention is to write the thread-local output to a string stream, (can be done in parallel) and then push the contents to cout (requires synchronization). 解决输出争用的一种方法是将线程局部输出写入字符串流(可以并行完成),然后将内容推送到cout (需要同步)。

Something like this: 像这样的东西:

#pragma omp parallel for
for (i=0; i<n; i++) {
    std::vector<MyData> related;
    for (size_t j=0; j < data[i].size(); j++)
        related.push_back(data[i][j]);
    sort(related.rbegin(), related.rend());

    std::stringstream buf;
    buf << data[i].label << "\n";
    for (size_t j=0; j<related.size(); j++)
        buf << "    " << related[j].label << "\n";

    #pragma omp critical
    std::cout << buf.rdbuf();
}

This offers much more fine-grained locking and the performance should increase accordingly. 这提供了更细粒度的锁定,并且性能应该相应地增加。 On the other hand, this still uses locking. 另一方面,这仍然使用锁定。 So another way would be to use an array of stream buffers, one for each thread, and pushing them to cout sequentially after the parallel loop. 另一种方法是使用一个流缓冲区数组,每个线程一个,并在并行循环按顺序将它们推送到cout This has the advantage of avoiding costly locks, and the output to cout must be serialized anyway. 这具有避免昂贵锁定的优点,并且无论如何必须序列化cout的输出。

On the other hand, you can even try to omit the critical section in the above code. 另一方面,您甚至可以尝试省略上述代码中的critical部分。 In my experience, this works since the underlying streams have their own way of controlling concurrency. 根据我的经验,这是有效的,因为底层流有自己的控制并发的方式。 But I believe that this behaviour is strictly implementation defined and not portable. 但我相信这种行为是严格的实现定义而不是可移植的。

cout contention is still going to be a problem here. cout争论仍然是一个问题。 Why not output the results in some thread-local storage and collate them to the desired location centrally, meaning no contention. 为什么不在一些线程本地存储中输出结果并将它们集中整理到所需位置,这意味着没有争用。 For example, you could have each target thread for the parallel code write to a separate filestream or memory stream and just concatenate them afterwards, since ordering is not important. 例如,您可以让并行代码的每个目标线程写入单独的文件流或内存流,然后将它们连接起来,因为排序并不重要。 Or postprocess the results from multiple places instead of one - no contention, single write only required. 或者从多个地方而不是一个地方对结果进行后处理 - 没有争用,只需要单次写入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM