简体   繁体   中英

C/C++ intrinsics for non-temporal loads of 32- and 64-bit values on x86_64?

Are there C/C++ intrinsics for non-temporal loads (ie loads without caching, directly from DRAM) of 32- and 64-bit values on x86_64?

My compiler is MSVC++2017 toolset v141. But intrinsics for other compilers are welcome, as well as references to the underlying assembly instructions.

At the time of writing (August 2017) there are no non-temporal loads to GP registers .


The only available non-temporal instructions are:

Integer domain

(v)movntdqa (load) despite the name this instruction moves 128/256/512 bits, aligned on their natural boundary, into xmm/ymm/zmm registers respectively.
(v)movntdq (store) despite the name this instruction moves xmm/ymm/zmm registers into a 128/256/512 bits, aligned on their natural boundary, memory location.

GP registers

movnti (store) store a 32/64-bit GP register into a DWORD/QWORD in memory.

MMX registers

movntq (store) store an MMX register into a QWORD in memory.

Floating point domain

(v)movntpd/s (store) (legacy and VEX encoded) store a xmm/ymm/zmm register into an aligned 128/256/512 bits memory location. Like movntdq but in the FP domain.

(v)movntpd/s (store) (EVEX encoded) store a xmm/ymm/zmm register into an aligned 512 bits memory location clearing the upper unused bits. Like movntdq but in the FP domain.
Intel manuals are contradictory on this

Masked movs

(v)maskmovdqu (store) stores the bytes of an xmm register according to the mask in another xmm register.

(v)maskmovq (store) stores the bytes of an MMX register according to the mask in another MMX register.

Take a look here: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=temporal

void _mm_stream_pi (__m64* mem_addr, __m64 a)
void _mm_stream_si32 (int* mem_addr, int a)

and some others

and

https://msdn.microsoft.com/en-us/library/hh977023.aspx

it is actually VS2015 documentation but the VS2017 one (at least for me) is strange, disorganised and I cant find anything there :).

for this at least as I know

void _mm_prefetch (char const* p, int i) is used for it. 

those loads are short enough to only inform the uP to do not evict other data from the cache without the performance penalty (so even for non-temporal load if the there is a room in the cache it will be cached, but it will not evict any data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM