堆栈上的G ++ SSE内存对齐

Question

I am attempting to re-write a raytracer using Streaming SIMD Extensions. 我正在尝试使用Streaming SIMD Extensions重新编写光线跟踪器。 My original raytracer used inline assembly and movups instructions to load data into the xmm registers. 我的原始光线跟踪器使用内联汇编和movups指令将数据加载到xmm寄存器中。 I have read that compiler intrinsics are not significantly slower than inline assembly (I suspect I may even gain speed by avoiding unaligned memory accesses), and much more portable, so I am attempting to migrate my SSE code to use the intrinsics in xmmintrin.h. 我已经读过编译器内在函数并不比内联汇编慢得多（我怀疑我甚至可以通过避免未对齐的内存访问来获得速度），而且更加可移植，所以我试图迁移我的SSE代码以使用xmmintrin.h中的内部函数。 The primary class affected is vector, which looks something like this: 受影响的主要类是vector，它看起来像这样：

#include "xmmintrin.h"
union vector {
    __m128 simd;
    float raw[4];
    //some constructors
    //a bunch of functions and operators
} __attribute__ ((aligned (16)));

I have read previously that the g++ compiler will automatically allocate structs along memory boundaries equal to that of the size of the largest member variable, but this does not seem to be occurring, and the aligned attribute isn't helping. 我之前已经读过g ++编译器会自动地沿着内存边界分配结构，这些结构等于最大成员变量的大小，但是这似乎没有发生，并且对齐的属性没有帮助。 My research indicates that this is likely because I am allocating a whole bunch of function-local vectors on the stack, and that alignment on the stack is not guaranteed in x86. 我的研究表明，这可能是因为我在堆栈上分配了一大堆函数局部向量，并且在x86中无法保证堆栈上的对齐。 Is there any way to force this alignment? 有没有办法强制这种对齐？ I should mention that this is running under native x86 Linux on a 32-bit machine, not Cygwin. 我应该提一下，这是在32位机器上的本机x86 Linux下运行，而不是Cygwin。 I intend to implement multithreading in this application further down the line, so declaring the offending vector instances to be static isn't an option. 我打算在此应用程序中进一步实现多线程，因此将违规的矢量实例声明为静态不是一种选择。 I'm willing to increase the size of my vector data structure, if needed. 如果需要，我愿意增加矢量数据结构的大小。

Answer 1

The simplest way is std::aligned_storage , which takes alignment as a second parameter. 最简单的方法是std::aligned_storage ，它将对齐作为第二个参数。

If you don't have it yet, you might want to check Boost's version . 如果您还没有，可能需要查看Boost的版本。

Then you can build your union: 然后你可以建立你的联盟：

union vector {
  __m128 simd;
  std::aligned_storage<16, 16> alignment_only;
}

Finally, if it does not work, you can always create your own little class: 最后，如果它不起作用，你总是可以创建自己的小班：

template <typename Type, intptr_t Align> // Align must be a power of 2
class RawStorage
{
public:
  Type* operator->() {
    return reinterpret_cast<Type const*>(aligned());
  }

  Type const* operator->() const {
    return reinterpret_cast<Type const*>(aligned());
  }

  Type& operator*() { return *(operator->()); }
  Type const& operator*() const { return *(operator->()); }

private:
  unsigned char* aligned() {
    if (data & ~(Align-1) == data) { return data; }
    return (data + Align) & ~(Align-1);
  }

  unsigned char data[sizeof(Type) + Align - 1];
};

It will allocate a bit more storage than necessary, but this way alignment is guaranteed. 它将分配比所需更多的存储空间，但这种方式保证了对齐。

int main(int argc, char* argv[])
{
  RawStorage<__m128, 16> simd;
  *simd = /* ... */;

  return 0;
}

With luck, the compiler might be able to optimize away the pointer alignment stuff if it detects the alignment is necessary right. 幸运的是，如果检测到对齐是正确的，编译器可能能够优化掉指针对齐的东西。

Answer 2

A few weeks ago, I had re-written an old ray tracing assignment from my university days, updating it to run it on 64-bit linux and to make use of the SIMD instructions. 几个星期前，我在大学时代重新编写了一个旧的光线跟踪任务，更新它以在64位Linux上运行并使用SIMD指令。 (The old version incidentally ran under DOS on a 486, to give you an idea of when I last did anything with it). （旧版本偶然在DOS下运行486，让你知道我上次做什么用它）。

There very well may be better ways of doing it, but here is what I did ... 很可能有更好的方法，但这就是我做的......

typedef float    v4f_t __attribute__((vector_size (16)));

class Vector {
    ...
    union {
        v4f_t     simd;
        float     f[4];
    } __attribute__ ((aligned (16)));

    ...
};

Disassembling my compiled binary showed that it was indeed making use of the movaps instruction. 反汇编我的编译二进制文件表明它确实使用了movaps指令。

Hope this helps. 希望这可以帮助。

Answer 3

Normally all you should need is: 通常你需要的是：

union vector {
    __m128 simd;
    float raw[4];
};

ie no additional __attribute__ ((aligned (16))) required for the union itself. 即联盟本身不需要额外的__attribute__ ((aligned (16))) 。

This works as expected on pretty much every compiler I've ever used, with the notable exception of gcc 2.95.2 back in the day, which used to screw up stack alignment in some cases. 这在我使用过的几乎所有编译器中都有预期的效果，当然还有gcc 2.95.2的明显例外，在某些情况下用于搞乱堆栈对齐。

Answer 4

I use this union trick all the time with __m128 and it works with GCC on Mac and Visual C++ on Windows, so this must be a bug in the compiler that you use. 我一直使用__m128使用这个联合技巧，它适用于Mac上的GCC和Windows上的Visual C ++，因此这必须是您使用的编译器中的错误。

The other answers contain good workarounds though. 其他答案包含很好的解决方法。

Answer 5

If you need an array of N of these objects, allocate vector raw[N+1] , and use vector* const array = reinterpret_cast<vector*>(reinterpret_cast<intptr_t>(raw+1) & ~15) as the base address of your array. 如果需要N个这些对象的数组，请分配vector raw[N+1] ，并使用vector* const array = reinterpret_cast<vector*>(reinterpret_cast<intptr_t>(raw+1) & ~15)作为基址你的阵列。 This will always be aligned. 这将始终保持一致。

堆栈上的G ++ SSE内存对齐

问题描述

5 个解决方案

解决方案1
6 已采纳 2011-02-11 08:33:48

解决方案2
3 2011-02-11 19:33:35

解决方案3
2 2011-02-11 08:40:42

解决方案4
1 2011-02-11 08:37:17

解决方案5
0 2011-02-11 05:20:17

堆栈上的G ++ SSE内存对齐

问题描述

5 个解决方案

解决方案1 6 已采纳 2011-02-11 08:33:48

解决方案2 3 2011-02-11 19:33:35

解决方案3 2 2011-02-11 08:40:42

解决方案4 1 2011-02-11 08:37:17

解决方案5 0 2011-02-11 05:20:17

解决方案1
6 已采纳 2011-02-11 08:33:48

解决方案2
3 2011-02-11 19:33:35

解决方案3
2 2011-02-11 08:40:42

解决方案4
1 2011-02-11 08:37:17

解决方案5
0 2011-02-11 05:20:17