为什么std :: fill（0）比std :: fill（1）慢？

Question

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value: 我在一个系统上观察到，当设置一个常量值0与一个常量值1或一个动态值相比， std::fill on a large std::vector<int>时显着且持续地较慢：

5.8 GiB/s vs 7.5 GiB/s 5.8 GiB / s vs 7.5 GiB / s

However, the results are different for smaller data sizes, where fill(0) is faster: 但是，对于较小的数据大小，结果是不同的，其中fill(0)更快：

With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s): 对于多个线程，在4 GiB数据大小时， fill(1)显示更高的斜率，但达到比fill(0)更低的峰值（51 GiB / s vs 90 GiB / s）：

This raises the secondary question, why the peak bandwidth of fill(1) is so much lower. 这引发了次要问题，为什么fill(1)的峰值带宽要低得多。

The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq ) with 8x16 GiB DDR4-2133. 测试系统是双插槽Intel Xeon CPU E5-2680 v3，设置为2.5 GHz（通过/sys/cpufreq ），带有8x16 GiB DDR4-2133。 I tested with GCC 6.1.0 ( -O3 ) and Intel compiler 17.0.1 ( -fast ), both get identical results. 我用GCC 6.1.0（ -O3 ）和英特尔编译器17.0.1（ -fast ）测试，两者都得到相同的结果。 GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23是组。 Strem/add/24 threads gets 85 GiB/s on the system. Strem / add / 24个线程在系统上获得85 GiB / s。

I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. 我能够在不同的Haswell双插槽服务器系统上重现这种效果，但没有任何其他架构。 For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster. 例如，在Sandy Bridge EP上，内存性能相同，而在缓存中fill(0)要快得多。

Here is the code to reproduce: 这是重现的代码：

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>

using value = int;
using vector = std::vector<value>;

constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;

void __attribute__((noinline)) fill0(vector& v) {
    std::fill(v.begin(), v.end(), 0);
}

void __attribute__((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}

void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
    {
        vector v(data_size / (sizeof(value) * nthreads));
        auto repeat = write_size / data_size;
#pragma omp barrier
        auto t0 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill0(v);
#pragma omp barrier
        auto t1 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill1(v);
#pragma omp barrier
        auto t2 = omp_get_wtime();
#pragma omp master
        std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
                  << write_size / (t2 - t1) << "\n";
    }
}

int main(int argc, const char* argv[]) {
    std::cout << "size,nthreads,fill0,fill1\n";
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, 1);
    }
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, omp_get_max_threads());
    }
    for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
        bench(max_data_size, nthreads);
    }
}

Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp . 提交的结果用g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp 。

Answer 1

From your question + the compiler-generated asm from your answer: 从您的问题+编译器生成的asm您的答案：

fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. fill(0)是一个ERMSB rep stosb ，它将在优化的微编码循环中使用256b存储。 (Works best if the buffer is aligned, probably to at least 32B or maybe 64B). （如果缓冲区对齐，则效果最佳，可能至少为32B或64B）。
fill(1) is a simple 128-bit movaps vector store loop. fill(1)是一个简单的128位movaps矢量存储循环。 Only one store can execute per core clock cycle regardless of width, up to 256b AVX. 无论宽度如何，每个核心时钟周期只能执行一个存储，最高可达256b AVX。 So 128b stores can only fill half of Haswell's L1D cache write bandwidth. 因此128b存储只能填充Haswell的L1D缓存写入带宽的一半。 This is why fill(0) is about 2x as fast for buffers up to ~32kiB. 这就是为什么fill(0)对于高达~32kiB的缓冲区来说速度大约是2倍。 Compile with -march=haswell or -march=native to fix that . 用-march=haswell或-march=native编译来修复它 。
Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. Haswell几乎无法跟上循环开销，但它仍然可以每个时钟运行1个存储，即使它根本没有展开。 But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. 但是每个时钟有4个融合域uop，这就是很多填充占用了无序窗口的空间。 Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. 一些展开可能会让TLB未命中开始在存储发生的地方之前进一步解决，因为存储地址微量的吞吐量比存储数据的吞吐量更多。 Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. 对于适合L1D的缓冲区，展开可能有助于弥补ERMSB与此向量循环之间的其余差异。 (A comment on the question says that -march=native only helped fill(1) for L1.) （对这个问题的评论说-march=native只能帮助fill(1) L1。）

Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell. 请注意， rep movsd （可用于为int元素实现fill(1) ）可能与Haswell上的rep stosb执行相同。 Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd ), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd . 虽然只有官方文档只保证ERMSB提供快速rep stosb （但不是rep stosd ），但支持ERMSB的实际CPU使用类似的高效微码来rep stosd 。 There is some doubt about IvyBridge, where maybe only b is fast. 对IvyBridge有一些疑问，也许只有b很快。 See the @BeeOnRope's excellent ERMSB answer for updates on this. 有关此更新，请参阅@ BeeOnRope的优秀ERMSB答案。

gcc has some x86 tuning options for string ops ( like -mstringop-strategy= alg and -mmemset-strategy=strategy ), but IDK if any of them will get it to actually emit rep movsd for fill(1) . gcc对字符串操作有一些x86调优选项（比如-mstringop-strategy= alg和-mmemset-strategy=strategy ），但IDK中的任何一个都会让它实际为fill(1)发出rep movsd 。 Probably not, since I assume the code starts out as a loop, rather than a memset . 可能不是，因为我假设代码开始是循环，而不是memset 。

With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s): 对于多个线程，在4 GiB数据大小时，fill（1）显示更高的斜率，但达到比fill（0）更低的峰值（51 GiB / s vs 90 GiB / s）：

A normal movaps store to a cold cache line triggers a Read For Ownership (RFO) . 正常的movaps存储到冷缓存行会触发读取所有权（RFO） 。 A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. 当movaps写入前16个字节时，大量真正的DRAM带宽用于从内存中读取高速缓存行。 ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. ERMSB存储为其存储使用无RFO协议，因此内存控制器仅写入。 (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever). （除了杂项读取之外，如果任何页面遍历错误，甚至在L3缓存中也可能出现页面表，并且可能在中断处理程序中出现一些加载错误等）。

@BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. @BeeOnRope 在评论中解释说，常规RFO存储和ERMSB使用的RFO避免协议之间的区别在于服务器CPU上某些缓冲区大小范围的缺点，其中uncore / L3缓存中存在高延迟。 See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth. 另请参阅链接的ERMSB答案，了解有关RFO与非RFO的更多信息，以及多核Intel CPU中的非核心（L3 /内存）的高延迟是单核带宽的问题。

movntps ( _mm_stream_ps() ) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps （ _mm_stream_ps() ）存储是弱排序的，因此它们可以绕过缓存并一次直接存储整个缓存行，而无需将缓存行读入L1D。 movntps avoids RFOs, like rep stos does. movntps避免了movntps ，就像rep stos movntps那样。 ( rep stos stores can reorder with each other, but not outside the boundaries of the instruction.) （ rep stos商店可以相互重新排序，但不能超出指令的范围。）

Your movntps results in your updated answer are surprising. 您的movntps导致您更新的答案令人惊讶。
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB . 对于具有大缓冲区的单个线程，您的结果是movnt >> regular RFO> ERMSB 。 So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. 因此，两个非RFO方法位于普通旧商店的相对侧，并且ERMSB远非最优化，这真的很奇怪。 I don't currently have an explanation for that. 我目前没有解释。 (edits welcome with an explanation + good evidence). （编辑欢迎提供解释和良好证据）。

As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. 正如我们所料， movnt允许多个线程实现高聚合存储带宽，如ERMSB。 movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. movnt总是直接进入行填充缓冲区然后直接进入内存，因此适合缓存的缓冲区大小要慢得多。 One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. 每个时钟一个128b矢量足以轻松地将单核的无RFO带宽饱和到DRAM。 Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (ie only when it saves the trouble of unpacking to 128b). 当存储CPU绑定的AVX 256b矢量化计算的结果时（即，只有当它解决了拆包到128b的麻烦时）， vmovntps ymm （256b）可能只比vmovntps xmm （128b）有一个可衡量的优势。

movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth). movnti带宽很低，因为存储在每个时钟1个存储movnti 4B块块瓶颈中，将数据添加到行填充缓冲区，而不是将这些行满的缓冲区发送到DRAM（直到你有足够的线程来使内存带宽饱和）。

@osgx posted some interesting links in comments : @osgx 在评论中发布了一些有趣的链接：

Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/ Agner Fog的asm优化指南，指令表和微指南： http ：//agner.org/optimize/
Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf . 英特尔优化指南： http ： //www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf 。
NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/ NUMA snooping： http ： //frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
https://software.intel.com/en-us/articles/intelr-memory-latency-checker https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture Intel Haswell-EP架构的高速缓存一致性协议和内存性能

See also other stuff in the x86 tag wiki. 另请参阅x86标记wiki中的其他内容。

Answer 2

I'll share my preliminary findings , in the hope to encourage more detailed answers . 我将分享我的初步调查结果 ，以期鼓励更详细的答案 。 I just felt this would be too much as part of the question itself. 我只觉得这将是问题本身的一部分。

The compiler optimizes fill(0) to a internal memset . 编译器将 fill(0) 优化为内部memset 。 It cannot do the same for fill(1) , since memset only works on bytes. 它不能对fill(1)执行相同的操作，因为memset仅适用于字节。

Specifically, both glibcs __memset_avx2 and __intel_avx_rep_memset are implemented with a single hot instruction: 具体来说，glibcs __memset_avx2和__intel_avx_rep_memset都使用一条热指令实现：

rep    stos %al,%es:(%rdi)

Wheres the manual loop compiles down to an actual 128-bit instruction: 手动循环编译为实际的128位指令：

add    $0x1,%rax                                                                                                       
add    $0x10,%rdx                                                                                                      
movaps %xmm0,-0x10(%rdx)                                                                                               
cmp    %rax,%r8                                                                                                        
ja     400f41

Interestingly while there is a template/header optimization to implement std::fill via memset for byte types, but in this case it is a compiler optimization to transform the actual loop. 有趣的是，虽然有一个模板/头优化通过memset为字节类型实现std::fill ，但在这种情况下，它是一个编译器优化来转换实际的循环。 Strangely,for a std::vector<char> , gcc begins to optimize also fill(1) . 奇怪的是，对于std::vector<char> ，gcc开始优化fill(1) 。 The Intel compiler does not, despite the memset template specification. 尽管有memset模板规范，英特尔编译器仍然没有。

Since this happens only when the code is actually working in memory rather than cache, makes it appears the Haswell-EP architecture fails to efficiently consolidate the single byte writes. 因为只有当代码实际在内存而不是缓存中工作时才会发生这种情况，因此看起来Haswell-EP架构无法有效地整合单字节写入。

I would appreciate any further insight into the issue and the related micro-architecture details. 如果您对该问题以及相关的微架构细节有任何进一步的了解，我将不胜感激 。 In particular it is unclear to me why this behaves so differently for four or more threads and why memset is so much faster in cache. 特别是我不清楚为什么四个或更多线程的行为如此不同以及为什么memset在缓存中更快。

Update: 更新：

Here is a result in comparison with 这是与之比较的结果

fill(1) that uses -march=native (avx2 vmovdq %ymm0 ) - it works better in L1, but similar to the movaps %xmm0 version for other memory levels. fill（1）使用-march=native （avx2 vmovdq %ymm0 ） - 它在L1中效果更好，但类似于其他内存级别的movaps %xmm0版本。
Variants of 32, 128 and 256 bit non-temporal stores. 32,128和256位非时间存储的变体。 They perform consistently with the same performance regardless of the data size. 无论数据大小如何，它们都能以相同的性能执行。 All outperform the other variants in memory, especially for small numbers of threads. 所有内容都优于内存中的其他变体，特别是对于少量线程。 128 bit and 256 bit perform exactly similar, for low numbers of threads 32 bit performs significantly worse. 128位和256位执行完全相似，对于低数量的线程，32位执行得更差。

For <= 6 thread, vmovnt has a 2x advantage over rep stos when operating in memory. 对于<= 6个线程， vmovnt在内存中运行时比rep stos vmovnt有2 vmovnt的优势 。

Single threaded bandwidth: 单线程带宽：

Aggregate bandwidth in memory: 内存中的聚合带宽：

Here is the code used for the additional tests with their respective hot-loops: 以下是用于各自热循环的附加测试的代码：

void __attribute__ ((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}
┌─→add    $0x1,%rax
│  vmovdq %ymm0,(%rdx)
│  add    $0x20,%rdx
│  cmp    %rdi,%rax
└──jb     e0


void __attribute__ ((noinline)) fill1_nt_si32(vector& v) {
    for (auto& elem : v) {
       _mm_stream_si32(&elem, 1);
    }
}
┌─→movnti %ecx,(%rax)
│  add    $0x4,%rax
│  cmp    %rdx,%rax
└──jne    18


void __attribute__ ((noinline)) fill1_nt_si128(vector& v) {
    assert((long)v.data() % 32 == 0); // alignment
    const __m128i buf = _mm_set1_epi32(1);
    size_t i;
    int* data;
    int* end4 = &v[v.size() - (v.size() % 4)];
    int* end = &v[v.size()];
    for (data = v.data(); data < end4; data += 4) {
        _mm_stream_si128((__m128i*)data, buf);
    }
    for (; data < end; data++) {
        *data = 1;
    }
}
┌─→vmovnt %xmm0,(%rdx)
│  add    $0x10,%rdx
│  cmp    %rcx,%rdx
└──jb     40


void __attribute__ ((noinline)) fill1_nt_si256(vector& v) {
    assert((long)v.data() % 32 == 0); // alignment
    const __m256i buf = _mm256_set1_epi32(1);
    size_t i;
    int* data;
    int* end8 = &v[v.size() - (v.size() % 8)];
    int* end = &v[v.size()];
    for (data = v.data(); data < end8; data += 8) {
        _mm256_stream_si256((__m256i*)data, buf);
    }
    for (; data < end; data++) {
        *data = 1;
    }
}
┌─→vmovnt %ymm0,(%rdx)
│  add    $0x20,%rdx
│  cmp    %rcx,%rdx
└──jb     40

Note: I had to do manual pointer calculation in order to get the loops so compact. 注意：我必须进行手动指针计算才能使循环变得如此紧凑。 Otherwise it would do vector indexing within the loop, probably due to the intrinsic confusing the optimizer. 否则它会在循环内进行向量索引，可能是由于优化器内在混淆。

为什么std :: fill（0）比std :: fill（1）慢？

问题描述

2 个解决方案

解决方案1
39 已采纳 2017-07-10 17:59:17

解决方案2
29 2017-03-02 15:04:55

为什么std :: fill（0）比std :: fill（1）慢？

问题描述

2 个解决方案

解决方案1 39 已采纳 2017-07-10 17:59:17

解决方案2 29 2017-03-02 15:04:55

解决方案1
39 已采纳 2017-07-10 17:59:17

解决方案2
29 2017-03-02 15:04:55