简体   繁体   English

用零并行填充std :: vector

[英]Parallel fill std::vector with zero

I want to fill a std::vector<int> with zero with openmp. 我想用openmp用零填充std::vector<int> How to do that quickly? 如何快速做到这一点?

I heard that looping over the vector to set each element to zero was slow, and std::fill was much faster. 我听说循环遍历向量将每个元素设置为零很慢,而std::fill则快得多。 Is that still true now? 现在还是这样吗?

Fastest way to reset every value of std::vector<int> to 0 将std :: vector <int>的每个值重置为0的最快方法

Do I have to manually divide the std::vector<int> into regions, use #pragma omp for loop over each thread, and then use std::fill in the loop? 我是否必须手动将std::vector<int>划分为多个区域,在每个线程上使用#pragma omp for循环,然后在循环中使用std::fill

You can split the vector into chunks for each thread to be filled with std::fill : 您可以将向量分成多个块,以供每个线程填充std::fill

#pragma omp parallel
{   
    auto tid = omp_get_thread_num();
    auto chunksize = v.size() / omp_get_num_threads();
    auto begin = v.begin() + chunksize * tid;
    auto end = (tid == omp_get_num_threads() -1) ? v.end() : begin + chunksize);
    std::fill(begin, end, 0);
}

You can further improve it by rounding chunksize to the nearest cacheline / memory word size (128 byte = 32 int s). 可以通过四舍五入进一步提高它chunksize到最近的超高速缓存行/字存储器大小(128字节= 32 int S)。 Assuming that v.data() is aligned similarly. 假设v.data()类似地对齐。 That way, you avoid any false sharing issues. 这样,您可以避免任何错误的共享问题。

On a dual socket 24 core Haswell system, I get a speedup of somewhere near 9x: 3.6s for 1 thread, to 0.4s for 24 threads, 4.8B ints = ~48 GB/s, the results vary a bit and this is not a scientific analysis. 在双插槽24核心Haswell系统上,我得到了大约9倍的加速:1个线程3.6s,24个线程0.4s,4.8B int =〜48 GB / s,结果有些不同,但这不是科学的分析。 But it is not too far off the memory bandwidth of the system. 但这离系统的内存带宽不太远。

For general performance, you should be concerned about dividing your vector not only for this operation, but also for further operations (be it read or write) the same way if possible. 为了获得良好的性能,您不仅应考虑将向量划分为该操作,而且还应尽可能以其他方式划分其他向量(无论是读取还是写入)。 That way, you increase the chance that the data is actually in cache if you need it, or at least on the same NUMA node. 这样,如果需要数据,或者至少在同一NUMA节点上,可以增加数据实际在缓存中的机会。

Oddly enough, on my system std::fill(..., 1); 奇怪的是,在我的系统上std::fill(..., 1); is faster than std::fill(..., 0) for a single thread, but slower for 24 threads. 对于单个线程,它比std::fill(..., 0)快,但是对于24个线程,它要慢。 Both with gcc 6.1.0 and icc 17.0.1. 两者均使用gcc 6.1.0和icc 17.0.1。 I guess I'll post that into a separate question. 我想我会将其发布到一个单独的问题中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM