sse C ++内存命令

Question

SSE asm has SQRTPS command. SSE组件具有SQRTPS命令。

SQRTPS command have 2 versions: SQRTPS命令具有2个版本：

SQRTPS xmm1, xmm2
SQRTPS xmm1, m128

gcc/clang/vs (all) compilers have helper function _mm_sqrt_ps . gcc / clang / vs（所有）编译器具有辅助函数_mm_sqrt_ps 。

But _mm_sqrt_ps can work only with preloaded xmm (with _mm_set_ps / _mm_load_ps). 但是_mm_sqrt_ps仅适用于预加载的xmm（使用_mm_set_ps / _mm_load_ps）。

From Visual Studio, for example: http://msdn.microsoft.com/en-us/library/vstudio/8z67bwwk%28v=vs.100%29.aspx 例如，从Visual Studio： http : //msdn.microsoft.com/zh-cn/library/vstudio/8z67bwwk%28v=vs.100%29.aspx

What I expect: 我的期望：

__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
asm{
    sqrtps xmm0, data             // DIRECTLY FROM MEMORY
    movaps result, xmm0
}

What I have (in C): 我所拥有的（C语言）：

__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
auto xmm  = _mm_load_ps(&data)   // or _mm_set_ps 
xmm = _mm_sqrt_ps(xmm);
_mm_store_ps(&result[0], xmm);

(in asm): （在asm中）：

movaps xmm1, data
sqrtps xmm0, xmm1               // FROM REGISTER
movaps result, xmm0

In other words, I would like to see something like this: 换句话说，我想看到这样的东西：

__attribute__((aligned(16))) float data[4];
__attribute__((aligned(16))) float result[4];
auto xmm = _mm_sqrt_ps(data);                  // DIRECTLY FROM MEMORY, no need to load (because there is such instruction)
_mm_store_ps(&result[0], xmm);

Answer 1

Quick research: I made the following file, called mysqrt.cpp : 快速研究：我制作了以下文件，名为mysqrt.cpp ：

#include <pmmintrin.h>

extern "C" __m128 MySqrt(__m128* a) {
    return _mm_sqrt_ps(a[1]);
}

Trying gcc, namely g++4.8 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s : 尝试gcc，即g++4.8 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s ：

_MySqrt:
LFB526:
    sqrtps  16(%rdi), %xmm0
    ret

Clang ( clang++3.6 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s ): clang++3.6 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s （ clang++3.6 -msse3 -O3 -S mysqrt.cpp && cat mysqrt.s ）：

_MySqrt:                                ## @MySqrt
    .cfi_startproc
## BB#0:                                ## %entry
    pushq   %rbp
Ltmp0:
    .cfi_def_cfa_offset 16
Ltmp1:
    .cfi_offset %rbp, -16
    movq    %rsp, %rbp
Ltmp2:
    .cfi_def_cfa_register %rbp
    sqrtps  16(%rdi), %xmm0
    popq    %rbp
    retq

Don't know about VS, but at least both gcc and clang seem to produce memory version of sqrtps if needed. 不了解VS，但如果需要的话，至少gcc和clang似乎都可以生成sqrtps内存版本。

UPDATE Example of function usage: UPDATE函数用法示例：

#include <iostream>
#include <pmmintrin.h>

extern "C" __m128 MySqrt(__m128* a);

int main() {
    __m128 x[2];
    x[1] = _mm_set_ps1(4);
    __m128 y = MySqrt(x);
    std::cout << y[0] << std::endl;
}

// output:
2

UPDATE 2: Regarding your code, you should just do: 更新2：关于您的代码，您应该这样做：

auto xmm = _mm_sqrt_ps(*reinterpret_cast<__m128*>(data));

And of course it will be at your own risk, you should guarantee that data contains valid __m128 and is properly aligned. 当然，这需要您自担风险，您应保证data包含有效的__m128并正确对齐。

Answer 2

I think you misunderstood the interface provided by the primitive _mm_sqrt_ps(__m128) . 我认为您误解了原始_mm_sqrt_ps(__m128)提供的接口。 The argument type here can be a variable hold in memory or in register. 这里的参数类型可以是保存在存储器或寄存器中的变量。 The extension type __m128 acts like any normal builtin type, eg double , and is not bound to an xmm register but can also be stored in memory. 扩展类型__m128行为类似于任何常规的内置类型，例如double ，并且未绑定到xmm寄存器，但也可以存储在内存中。

EDIT Unless you use asm , the compiler determines if and when a variable is loaded into register or left in memory. 编辑除非您使用asm ，否则编译器将确定是否以及何时将变量加载到寄存器中或保留在内存中。 So, in the following code snippet 因此，在以下代码段中

__m128 foo(const __m128 x, const __m128*y, std::size_t n)
{
  __m128 result = _mm_set_ps(1.0);
  while(n--)
    result = _mm_mul_ps(result,_mm_add_ps(x,_mm_sqrt_ps(*y++)));
  return result;
}

it's up to the compiler which variables are stored in register. 哪些变量存储在寄存器中由编译器决定。 I would think that the compiler puts x and result into xmm registers, but gets *y directly from memory. 我认为编译器会将x和result放入xmm寄存器，但直接从内存获取*y 。

Answer 3

The answer to your question is that you can't control this , at least for aligned loads, with intrinsics. 您的问题的答案是，您至少无法使用内部函数来控制this，至少对于对齐的负载而言。 It's up to the compiler to decide if it uses SQRTPS xmm1, xmm2 or SQRTPS xmm1, m128. 由编译器决定是否使用SQRTPS xmm1，xmm2或SQRTPS xmm1，m128。 If you want to be 100% certain then you have to write it in assembly. 如果要100％确定，则必须以汇编形式编写。 This is one of the deficiencies of intrinsics (at least as they are currently implemented) in my opinion. 我认为，这是内在函数的不足之一（至少在当前已实现）。

Some code can help explain this. 一些代码可以帮助解释这一点。

We can get GCC (64-bit with -O3) to generate both version using aligned and unaligned loads 我们可以得到GCC（带有-O3的64位）来使用对齐和不对齐的负载生成两个版本

float x[4], y[4]
__m128 x4 = _mm_loadu_ps(x);
__m128 y4 = _mm_sqrt_ps(x4);
_mm_storeu_ps(y,y4);

This gives (with Intel syntax) 这给出了（使用Intel语法）

movups  xmm0, XMMWORD PTR [rdx]
sqrtps  xmm0, xmm0

However, if we do an aligned load we get the other form 但是，如果我们进行对齐的荷载，则会得到另一种形式

float x[4], y[4]
__m128 x4 = _mm_load_ps(x);
__m128 y4 = _mm_sqrt_ps(x4);
_mm_storeu_ps(y,y4);

This combines the load and square root into one instruction 这将载荷和平方根组合成一条指令

sqrtps  xmm0, XMMWORD PTR [rax]

Most people would say "trust the compiler." 大多数人会说“信任编译器”。 I disagree. 我不同意。 If you're using intrinsics then it should be assumed that YOU know what you're doing and NOT the compiler. 如果使用内在函数，则应假定您知道自己在做什么，而不是编译器。 Here is an example difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp where GCC chose one form and MSVC chose the other form (for multiplication instead of the sqrt) and it made a difference in performance. 这是msvc和gcc在高度优化的矩阵multp之间的性能差异示例，其中GCC选择了一种形式，而MSVC选择了另一种形式（用于乘法而不是sqrt），性能差异。

So once again, if you're using aligned loads, you can only pray that the compiler does what you want. 因此，再次重申，如果您使用对齐的负载，则只能祈祷编译器可以执行所需的操作。 And then maybe on the next version of the compiler it does something different... 然后也许在下一版的编译器中它做了一些不同的工作...

sse C ++内存命令

问题描述

3 个解决方案

解决方案1
2 2014-09-01 08:46:11

解决方案2
1 2014-09-01 08:54:51

解决方案3
0 已采纳 2014-09-01 12:47:40

sse C ++内存命令

问题描述

3 个解决方案

解决方案1 2 2014-09-01 08:46:11

解决方案2 1 2014-09-01 08:54:51

解决方案3 0 已采纳 2014-09-01 12:47:40

解决方案1
2 2014-09-01 08:46:11

解决方案2
1 2014-09-01 08:54:51

解决方案3
0 已采纳 2014-09-01 12:47:40