简体繁体 English

_mm512_storenr_pd和_mm512_storenrngo_pd

[英]_mm512_storenr_pd and _mm512_storenrngo_pd

原文 2017-08-16 13:50:16 3 1 intel/ intrinsics/ xeon-phi/ avx512

What is the difference between _mm512_storenrngo_pd and _mm512_storenr_pd ? _mm512_storenrngo_pd和_mm512_storenr_pd有什么区别？

_mm512_storenr_pd(void * mt, __m512d v): _mm512_storenr_pd（void * mt，__ m512d v）：

Stores packed double-precision (64-bit) floating-point elements from v to memory address mt with a no-read hint to the processor. 存储从v到内存地址mt的打包双精度（64位）浮点元素，并向处理器提供无读取提示 。

It is not clear to me, what no-read hint means. 我不清楚，没有读取提示意味着什么。 Does it mean, that it is a non-cache coherent write. 这是否意味着，它是一个非缓存连贯写入。 Does it mean, that a reuse is more expensive or not coherent? 这是否意味着，重用更昂贵或不连贯？

_mm512_storenrngo_pd(void * mt, __m512d v): _mm512_storenrngo_pd（void * mt，__ m512d v）：

Stores packed double-precision (64-bit) floating-point elements from v to memory address mt with a no-read hint and using a weakly-ordered memory consistency model (stores performed with this function are not globally ordered, and subsequent stores from the same thread can be observed before them). 存储从v到内存地址mt的打包双精度（64位）浮点元素，带有无读取提示并使用弱排序的内存一致性模型（使用此函数执行的存储不是全局排序的，以及后续存储来自在它们之前可以观察到相同的线程）。

Basically the same as storenr_pd , but since it uses a weak consistency model, this means that a process can view its own writes before any other processor. 与storenr_pd基本相同，但由于它使用弱一致性模型，这意味着进程可以在任何其他处理器之前查看自己的写入。 But the access of another processor is non-coherent or more expensive? 但是另一个处理器的访问是非连贯的还是更昂贵的？

1 个解决方案

Quote from Intel® Xeon Phi™ Coprocessor Vector Microarchitecture : 引用英特尔®至强融核™协处理器矢量微体系结构：

In general, in order to write to a cache line, the Xeon Phi™ coprocessor needs to read in a cache line before writing to it. 通常，为了写入高速缓存行，Xeon Phi™协处理器需要在写入之前读入高速缓存行。 This is known as read for ownership (RFO). 这称为所有权读取（RFO）。 One problem with this implementation is that the written data is not reused; 这种实现的一个问题是写入的数据不被重用; we unnecessarily take up the BW for reading non-temporal data. 我们不必要地占用BW来读取非时间数据。 The Intel® Xeon Phi™ coprocessor supports instructions that do not read in data if the data is a streaming store. 如果数据是流媒体存储，则英特尔®至强融核™协处理器支持不读取数据的指令。 These instructions, VMOVNRAP*, VMOVNRNGOAP* allow one to indicate that the data needs to be written without reading the data first. 这些指令VMOVNRAP *，VMOVNRNGOAP *允许指示在不首先读取数据的情况下写入数据。 In the Xeon Phi ISA the VMOVNRAPS/VMOVNRPD instructions are able to optimize the memory BW in case of a cache miss by not going through the unnecessary read step. 在Xeon Phi ISA中，VMOVNRAPS / VMOVNRPD指令能够在高速缓存未命中的情况下通过不经过不必要的读取步骤来优化存储器BW。

The VMOVNRNGOAP* instructions are useful when the programmer tolerates weak write-ordering of the application data―that is, the stores performed by these instructions are not globally ordered. 当程序员容忍应用程序数据的弱写入顺序时，VMOVNRNGOAP *指令很有用 - 也就是说，这些指令执行的存储不是全局排序的。 This means that the subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed. 这意味着在执行VMOVNRNGOAP指令之前可以观察到相同线程的后续写入。 A memory-fencing operation should be used in conjunction with this operation if multiple threads are reading and writing to the same location. 如果多个线程正在读取和写入同一位置，则应与此操作一起使用内存屏蔽操作。

It seems that " No-read hints ", " Streaming store ", and " Non-temporal Stream/Store " are used interchangeably in several resources. 似乎“ 无读取提示 ”，“ 流式存储 ”和“ 非时态流/存储 ”在若干资源中可互换使用。

So yes it is non-cache coherent write, though with Knights Corner (KNC, where both vmovnrap* and vmovnrngoap* belong) the stores happen to L2 cache, it does not bypass all levels of cache. 所以是的，它是非缓存一致性写入，但是对于Knights Corner（KNC，其中vmovnrap *和vmovnrngoap *都属于），存储发生在L2缓存中，它不会绕过所有级别的缓存。

As explained in above quote, vmovnrngoap* is special from vmovnrap* that weakly-ordered memory consistency model allows " subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed ", so yes the access of another thread or processor is non-coherent, and a fencing operation should be used. 如上文引用解释的，vmovnrngoap *是来自vmovnrap *即弱有序存储器一致性模型允许“ 前VMOVNRNGOAP指令被执行可以观察到由同一个线程随后写入 ”特殊的，所以是另一个线程或处理器的访问是不可 - 相干，并应使用击剑操作。 Though CPUID can be used as the fencing operation, better options are "LOCK ADD [RSP],0" (a dummy atomic add) or XCHG (which combines a store and a fence). 虽然CPUID可以用作防护操作，但更好的选项是“LOCK ADD [RSP]，0”（虚拟原子添加）或XCHG（它结合了存储和围栏）。

A few more details: 更多细节：

On KNC if you use compiler switch (-opt-streaming-stores always) or pragma (#pragma vector nontemporal), the default generated code will be VMOVNRNGOAP* starting with Composer XE 2013 Update 1; 在KNC上，如果使用编译器开关（-opt-streaming-stores always）或pragma（#pragma vector nontemporal），默认生成的代码将是VMOVNRNGOAP *，从Composer XE 2013 Update 1开始;
More quotes from COMPILER-BASED MEMORY OPTIMIZATIONS FOR HIGH PERFORMANCE COMPUTING SYSTEMS 更高报价来自基于编译器的高性能计算系统存储器优化

NR Stores .The NR store instruction (vmovnr) is a standard vector store instruction that can always be used safely. NR存储 .NR存储指令（vmovnr）是一个标准的矢量存储指令，可以随时安全地使用。 An NR store instruction that misses in the local cache causes all potential copies of the cache line in remote caches to be invalidated, the cache line to be allocated (but not initialized) at the local cache in exclusive state, and the write-data in the instruction to be written to the cacheline. 在本地高速缓存中未命中的NR存储指令导致远程高速缓存中的高速缓存行的所有潜在副本无效，高速缓存行将在独占状态的本地高速缓存中分配（但未初始化），并且写入数据在要写入高速缓存行的指令。 There is no data transfer from main memory which is what saves memory bandwidth. 没有来自主存储器的数据传输，这节省了存储器带宽。 An NR store instruction and other load and/or store instructions from the same thread are globally ordered, which means that all observers of this sequence of instructions always see the same fixed execution order. 来自同一线程的NR存储指令和其他加载和/或存储指令是全局排序的，这意味着该指令序列的所有观察者总是看到相同的固定执行顺序。

The NR.NGO (non-globally ordered) store instruction(vmovnrngo) relaxes the global ordering constraint of the NR store instruction.This relaxation makes the NR.NGO instruction have a lower latency than the NRinstruction, which can be used to achieve higher performance in streaming storeintensive applications. NR.NGO （非全局排序）存储指令（vmovnrngo）放宽了NR存储指令的全局排序约束。这种放宽使NR.NGO指令具有比NR指令更低的延迟，这可用于实现更高的性能在流式商店密集型应用程序中。 However, removing this restriction means that an NR.NGO store instruction and other load and/or store instructions from the same thread can be observed by two observers to have two different orderings. 但是，删除此限制意味着两个观察者可以观察到来自同一线程的NR.NGO存储指令和其他加载和/或存储指令，以具有两个不同的顺序。 The use of NR.NGO store instructions is safe only when reordering the order of these instructions is verified not to change the outcome. 只有在重新排序这些指令的顺序被验证不改变结果时，NR.NGO存储指令的使用才是安全的。 Otherwise, using NR.NGO stores may lead to incorrect execution. 否则，使用NR.NGO存储可能会导致执行错误。 Our compiler can generate NR.NGO store instructions for store instructions that it identifies to have non-temporal behavior. 我们的编译器可以为存储指令生成NR.NGO存储指令，它指出它具有非时间行为。 For instance, a parallel loop that is detected to be non-temporal by our compiler can make use of NR.NGO instructions. 例如，我们的编译器检测到非时间的并行循环可以使用NR.NGO指令。 At the end of such a loop, to ensure all outstanding non-globally ordered stores are completed and all threads have a consistent view of memory, our compiler generates a fence (a lock instruction) after the loop. 在这样的循环结束时，为了确保完成所有未完成的非全局排序存储并且所有线程都具有一致的内存视图，我们的编译器在循环之后生成一个fence（锁定指令）。 This fence is needed before continuing execution of the subsequent code fragment to ensure all threads have exactly the same view of memory. 在继续执行后续代码片段之前需要此fence，以确保所有线程具有完全相同的内存视图。

A general rule of thumb is that non-temporal store benefit memory access blocks that are not reused in the immediate future. 一般的经验法则是非临时存储有益于在不久的将来不会重用的内存访问块。 So that yes reuse will be expensive in both cases. 因此，在两种情况下重复使用都会很昂贵。