简体繁体 English

SSE：跨越页面边界的未对齐加载和存储

[英]SSE: unaligned load and store that crosses page boundary

原文 2016-06-09 21:27:48 0 1 c/ linux/ x86-64/ sse/ memory-alignment

I read somewhere that before performing unaligned load or store next to page boundary (eg using _mm_loadu_si128 / _mm_storeu_si128 intrinsics), code should first check if whole vector (in this case 16 bytes) belongs to the same page, and switch to non-vector instructions if not. 我在某处读到了在页面边界旁边执行未对齐的加载或存储之前（例如使用_mm_loadu_si128 / _mm_storeu_si128内在函数），代码应首先检查整个向量（在这种情况下是16个字节）是否属于同一页面，并切换到非向量指令如果不。 I understand that this is needed to prevent coredump if next page does not belong to process. 我知道如果下一页不属于进程，则需要这样做以防止coredump。

But what if both pages belongs to process (eg they are part of one buffer, and I know size of that buffer)? 但是，如果两个页面都属于进程（例如，它们是一个缓冲区的一部分，并且我知道该缓冲区的大小），该怎么办？ I wrote small test program which performed unaligned load and store that crossed page boundary, and it did not crash. 我写了一个小的测试程序，它执行了未对齐的加载和跨越页面边界的存储，并没有崩溃。 Do I have to always check for page boundary in such case, or it is enough to make sure I will not overflow the buffer? 在这种情况下，我是否必须始终检查页面边界，还是足以确保我不会溢出缓冲区？

Env: Linux, x86_64, gcc 环境：Linux，x86_64，gcc

1 个解决方案

Page-line splits are bad for performance, but don't affect correctness of unaligned accesses. 页面行拆分对性能不利，但不影响未对齐访问的正确性。 It is enough to make sure you don't read past the end of the buffer , when you know the length ahead of time. 当您提前知道长度时， 足以确保您不会读取缓冲区的末尾 。

For correctness, you often need to worry about it when implementing something like strlen , where your loop stops when you find a sentinel value. 为了正确strlen ，在实现像strlen这样的东西时，你经常需要担心它，当你找到一个sentinel值时你的循环停止。 That value could be at any position within your vector, so just doing 16B unaligned loads will read past the end of the array. 该值可以位于向量中的任何位置，因此仅执行16B未对齐的加载将读取超出数组的末尾。 If the terminating 0 is in the last byte of one page, and the next page is not readable, and your current-position pointer is unaligned, a load that includes the 0 byte will also include bytes from the unreadable page, so it will fault. 如果终止0在一个页面的最后一个字节中，并且下一页不可读，并且您的当前位置指针未对齐，则包含0字节的加载也将包含来自不可读页面的字节，因此它将出错。

One solution is to do scalar until your pointer is aligned, then load aligned vectors. 一种解决方案是执行标量直到指针对齐，然后加载对齐的向量。 An aligned load always comes entirely from one page, and also from one cache-line. 对齐的加载始终完全来自一个页面，也来自一个缓存行。 So even though you will read some bytes past the end of the string, you are guaranteed not to fault. 因此，即使您将读取字符串末尾之后的一些字节，也可以保证不会出错。 Valgrind might be unhappy about it, though, but standard library strlen implementations use this. 但是Valgrind可能对它不满意，但是标准的库strlen实现使用了这个。

Instead of scalar until an aligned pointer, you could do an unaligned vector from the start of the string (as long as that won't cross a page-line), and then do aligned loads. 你可以从字符串的开头做一个未对齐的向量（只要它不会越过页面行），然后做对齐的加载，而不是标量直到对齐的指针。 The first aligned load will overlap the first unaligned load, but that's totally fine for a function like strlen that doesn't care if it sees the same data twice. 第一个对齐的加载将与第一个未对齐的加载重叠，但对于像strlen这样的函数来说，它完全没问题，如果它看到相同的数据两次则无关紧要。

It might be worth avoiding page-line splits for performance reasons. 出于性能原因，可能值得避免页面分割。 Even if you know your src pointer is misaligned, it's often faster to let the hardware handle cache-line splits. 即使你知道你的src指针未对齐，让硬件处理缓存行分裂通常会更快。 But before Skylake, page-splits have an extra ~100c latency. 但在Skylake之前，页面拆分有额外的~100c延迟。 ( Down to 5c in Skylake ). （ Skylake降至5c ）。 If you have multiple pointers that can be aligned differently relative to each other, you can't always just use a prologue to align your src. 如果你有多个指针可以彼此不同地对齐，你不能总是只使用序言来对齐你的src。 (eg c[i] = a[i] + b[i] , and c is aligned but b isn't.) （例如c[i] = a[i] + b[i] ， c对齐但b不对齐。）

In that case, it might be worth using a branch to do aligned loads from before and after the page split, and combine them with palignr . 在这种情况下，可能值得使用分支来完成页面拆分之前和之后的对齐加载，并将它们与palignr结合使用。

A branch mispredict (~15c) is cheaper than the page-split latency, but delays everything (not just the load). 分支错误预测（~15c）比页面分割延迟便宜，但会延迟所有内容（不仅仅是负载）。 So it might also not be worth it, depending on the hardware and ratio of computation to memory access. 所以它可能也不值得，这取决于硬件和计算与内存访问的比率。

If you're writing a function that is usually called with aligned pointers, it makes sense to just use unaligned load/store instructions. 如果你正在编写一个通常使用对齐指针调用的函数，那么只使用未对齐的加载/存储指令是有意义的。 Any prologue to detect misalignment is just extra overhead for the already-aligned case, and on modern hardware (Nehalem and newer), unaligned loads on address that turn out to be aligned at runtime have identical performance to aligned load instructions. 任何检测错位的序言只是已经对齐的情况下的额外开销，而在现代硬件（Nehalem和更新版本）上，在运行时对齐的地址上的未对齐加载与对齐的加载指令具有相同的性能。 (But you need AVX for unaligned loads to fold into other instructions as memory operands. eg vpxor xmm0, xmm1, [rsi] ) （但是你需要AVX将未对齐的加载作为内存操作数折叠到其他指令中。例如vpxor xmm0, xmm1, [rsi] ）

By adding code to handle misaligned inputs, you're slowing down the common aligned case to speed up the uncommon misaligned case. 通过添加代码来处理未对齐的输入，您正在减慢常见的对齐情况，以加速不常见的错位情况。 Fast hardware support for unaligned loads/stores lets software just leave that to the hardware for the few cases where it does happen. 对未对齐的加载/存储的快速硬件支持允许软件将其留给硬件，以用于发生它的少数情况。

(If misaligned inputs are common, then it is worth it to use a prologue to align your input pointer, esp. if you're using AVX. Sequential 32B AVX loads will cache-line split every other load.) （如果未对齐输入是常见的，那么使用序言来对齐输入指针是值得的，例如，如果您使用的是AVX。顺序32B AVX加载将缓存线分割每隔一个负载。）

See Agner Fog's Optimizing Assembly guide for more info, and other links in the x86 tag wiki. 有关详细信息，请参阅Agner Fog的优化装配指南，以及x86标签wiki中的其他链接。