简体   繁体   English

SSE指令MOVSD(扩展:x86上的浮点标量和向量运算,x86-64)

[英]SSE instruction MOVSD (extended: floating point scalar & vector operations on x86, x86-64)

I am somehow confused by the MOVSD assembly instruction. 我莫名其妙地被MOVSD汇编指令搞糊涂了。 I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. 我写了一些计算一些矩阵乘法的数字代码,简单地使用没有SSE内在函数的普通C代码。 I do not even include the header file for SSE2 intrinsics for compilation. 我甚至没有包含用于编译的SSE2内在函数的头文件。 But when I check the assembler output, I see that: 但是当我检查汇编器输出时,我看到:

1) 128-bit vector registers XMM are used; 1)使用128位向量寄存器XMM; 2) SSE2 instruction MOVSD is invoked. 2)调用SSE2指令MOVSD。

I understand that MOVSD essentially operates on single double precision floating point. 我知道MOVSD基本上是在单双精度浮点上运行。 It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things: 它只使用XMM寄存器的低64位并设置高64位0.但我只是不明白两件事:

1) I never give the compiler any hint for using SSE2. 1)我从不给编译器任何使用SSE2的提示。 Plus, I am using GCC not intel compiler. 另外,我使用GCC而不是英特尔编译器。 As far as I know, intel compiler will automatically seek opportunities for vectorization, but GCC will not. 据我所知,intel编译器会自动寻找矢量化的机会,但GCC不会。 So how does GCC know to use MOVSD?? 那么GCC如何知道使用MOVSD? Or, has this x86 instruction been around long before SSE instruction set, and the _mm_load_sd() intrinsics in SSE2 is just to provide backward compatibility for using XMM registers for scalar computation? 或者,这个x86指令早在SSE指令集之前就已经存在了,而SSE2中的_mm_load_sd()内在函数只是为了提供向后兼容性来使用XMM寄存器进行标量计算?

2) Why does not the compiler use other floating point registers, either the 80-bit floating point stack, or 64-bit floating point registers?? 2)为什么编译器不使用其他浮点寄存器,80位浮点堆栈或64位浮点寄存器? Why must it take the toll using XMM register (by setting upper 64-bit 0 and essentially wasting that storage)? 为什么必须使用XMM寄存器(通过设置高64位0并基本上浪费该存储)? Does XMM do provide faster access?? XMM确实提供了更快的访问吗?


By the way, I have another question regarding SSE2. 顺便说一句,我还有另一个关于SSE2的问题。 I just can't see the difference between _mm_store_sd() and _mm_storel_sd(). 我只是看不到_mm_store_sd()和_mm_storel_sd()之间的区别。 Both store the lower 64-bit value to an address. 两者都将较低的64位值存储到地址。 What is the difference? 有什么区别? Performance difference?? 性能差异?? Alignment difference?? 对齐差异??

Thank you. 谢谢。


Update 1: 更新1:

OKAY, obviously when I first asked this question, I lacked some basic knowledge on how a CPU manages floating point operations. 好的,显然当我第一次提出这个问题时,我缺乏一些关于CPU如何管理浮点运算的基本知识。 So experts tend to think my question is non-sense. 因此专家倾向于认为我的问题是无意义的。 Since I did not include even the shortest sample C code, people might think this question vague as well. 由于我没有包括最短的样本C代码,人们可能会认为这个问题也很模糊。 Here I would provide a review as an answer, which hopefully will be useful to any people unclear about the floating point operations on modern CPUs. 在这里,我将提供一个回答作为答案,希望对任何不清楚现代CPU上的浮点运算的人都有用。

A review of floating point scalar/vector processing on modern CPUs 现代CPU上浮点标量/向量处理的综述

The idea of vector processing dates back to old time vector processors , but these processors had been superseded by modern architectures with cache systems. 矢量处理的概念可追溯到旧时矢量处理器 ,但这些处理器已被具有缓存系统的现代架构所取代。 So we focus on modern CPUs, especially x86 and x86-64 . 所以我们专注于现代CPU,尤其是x86x86-64 These architectures are the main stream in high performance scientific computing . 这些架构是高性能科学计算的主流

Since i386, Intel introduced the floating point stack where floating point numbers up to 80-bit wide can be held. 从i386开始,英特尔推出了浮点堆栈,其中可以保存高达80位宽的浮点数。 This stack is commonly known as x87 or 387 floating point "registers" , with a set of x87 FPU instructions . 该堆栈通常称为x87或387浮点“寄存器” ,带有一组x87 FPU指令 x87 stack are not real, directly addressable registers like general purpose registers, as they are on a stack. x87堆栈不是真正的,可直接寻址的寄存器,如通用寄存器,因为它们位于堆栈中。 Access to register st(i) is by offsetting the stack top register %st(0) or simply %st. 访问寄存器st(i)是通过偏移堆栈顶部寄存器%st(0)或简单地%st。 With help of an instruction FXCH which swaps the contents between current stack top %st and some offset register %st(i), random access can be achieved. 借助于在当前堆栈顶部%st和一些偏移寄存器%st(i)之间交换内容的指令FXCH,可以实现随机访问。 But FXCH can impose some performance penalty, though minimized. 但FXCH可以施加一些性能损失,尽管最小化。 x87 stack provides high precision computation by calculating intermediate results with 80 bits of precision by default, to minimise roundoff error in numerically unstable algorithms. x87堆栈通过默认情况下以80位精度计算中间结果来提供高精度计算,以最小化数值不稳定算法中的舍入误差。 However, x87 instructions are completely scalar. 但是,x87指令是完全标量的。

The first effort on vectorization is the MMX instruction set , which implemented integer vector operations. 矢量化的第一个努力是MMX指令集 ,它实现了整数向量运算。 The vector registers under MMX are 64-bit wide registers MMX0, MMX1, ..., MMX7. MMX下的向量寄存器是64位宽寄存器MMX0,MMX1,...,MMX7。 Each can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format. 每个都可用于以“打包”格式保存64位整数或多个较小整数。 A single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. 然后可以将单个指令同时应用于两个32位整数,四个16位整数或八个8位整数。 So now there are the legacy general purpose registers for scalar integer operations, as well as new MMX for integer vector operations with no shared execution resources. 因此,现在存在用于标量整数运算的传统通用寄存器,以及用于没有共享执行资源的整数向量运算的新MMX。 But MMX shared execution resources with scalar x87 FPU operation: each MMX register corresponded to the lower 64 bits of an x87 register, and the upper 16 bits of the x87 registers is unused. 但MMX使用标量x87 FPU操作共享执行资源:每个MMX寄存器对应x87寄存器的低64位,而x87寄存器的高16位未使用。 These MMX registers were each directly addressable. 这些MMX寄存器均可直接寻址。 But the aliasing made it difficult to work with floating point and integer vector operations in the same application. 但是混叠使得在同一应用程序中使用浮点和整数向量操作变得困难。 To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible. 为了最大限度地提高性能,程序员通常只在一种模式或另一种模式下使用处理器,尽可能延迟它们之间相对较慢的切换。

Later, SSE created a separate set of 128-bit wide registers XMM0–XMM7 along side of x87 stack. 之后, SSE在x87堆栈的旁边创建了一组独立的128位宽寄存器XMM0-XMM7。 SSE instructions focused exclusively on single-precision floating-point operations (32-bit); SSE指令专注于单精度浮点运算(32位); integer vector operations were still performed using the MMX register and MMX instruction set. 使用MMX寄存器和MMX指令集仍然执行整数向量运算。 But now both operations can proceed at the same time, as they share no execution resources. 但是现在两个操作可以同时进行,因为它们不共享执行资源。 It is important to know that SSE not only do floating point vector operations, but also floating point scalar operations. 重要的是要知道SSE不仅进行浮点向量运算,还进行浮点标量运算。 Essentially it provides a new place where floating operations take place, and the x87 stack is no longer prior choice to carry out floating operations. 从本质上讲,它提供了一个浮动操作发生的新位置,而x87堆栈不再是执行浮动操作的先决条件。 Using XMM registers for scalar floating point operations is faster than using x87 stack, as all XMM registers are easier to access, while the x87 stack can't be randomly accessed without FXCH. 使用XMM寄存器进行标量浮点运算比使用x87堆栈更快,因为所有XMM寄存器都更容易访问,而x87堆栈不能在没有FXCH的情况下随机访问。 When I posted my question, I was clearly unaware of this fact. 当我发布我的问题时,我显然没有意识到这一事实。 The other concept I was not clear about is that general purpose registers are integer/address registers. 我不清楚的另一个概念是通用寄存器是整数/地址寄存器。 Even if they are 64-bit on x86-64, they can not hold 64-bit floating point. 即使它们在x86-64上是64位,它们也不能保持64位浮点。 The main reason is that the execution unit associated with general purpose registers is ALU (arithmetic & logical unit), which is not for floating point computation. 主要原因是与通用寄存器相关联的执行单元是ALU(算术和逻辑单元),其不用于浮点计算。

SSE2 is a major progress, as it extends vector data type, so SSE2 instructions, either scalar or vector, can work with all C standard data type. SSE2是一个重大进步,因为它扩展了矢量数据类型,因此SSE2指令(标量或矢量)可以与所有C标准数据类型一起使用。 Such extension in fact makes MMX obsolete. 事实上,这种扩展使MMX过时了。 Also, x87 stack is no long as important as it once was. 此外,x87堆栈不像以前那么重要。 Since there are two alternative places where floating point operations can take place, you can specify your option to the compiler. 由于有两个可以进行浮点运算的替代位置,因此可以指定编译器的选项。 For example for GCC, compilation with flag 例如对于GCC,使用标志进行编译

-mfpmath=387

will schedule floating point operations on the legacy x87 stack. 将在旧版x87堆栈上安排浮点运算。 Note that this seems to be the default for 32-bit x86, even if SSE is already available. 请注意,这似乎是32位x86的默认设置,即使SSE已经可用。 For example, I have an Intel Core2Duo laptop made in 2007, and it was already equipped with SSE release up to version SSE4, while GCC will still by default use x87 stack, which makes scientific computations unnecessarily slower. 例如,我有一台2007年制造的英特尔Core2Duo笔记本电脑,它已经配备SSE版本SSE4,而GCC仍默认使用x87堆栈,这使得科学计算不必要地变慢。 In this case, we need compile with flag 在这种情况下,我们需要使用flag进行编译

-mfpmath=sse

and GCC will schedule floating point operations on XMM registers. GCC将在XMM寄存器上安排浮点运算。 64-bit x86-64 user needs not worry about such configuration as this is default on x86-64. 64位x86-64用户无需担心此类配置,因为这是x86-64上的默认配置。 Such signal will only affect scalar floating point operation. 这种信号只会影响标量浮点运算。 If we have written code using vector instructions and compiler the code with flag 如果我们使用向量指令编写代码并使用标志编译代码

-msse2

then XMM registers will be the only place where computation can take place. 那么XMM寄存器将是唯一可以进行计算的地方。 In other words, this flags turns on -mfpmath=sse. 换句话说,这个标志打开-mfpmath = sse。 For more information see GCC's configuration of x86, x86-64 . 有关更多信息,请参阅GCC的x86,x86-64配置 For examples of writing SSE2 C code, see my other post How to ask GCC to completely unroll this loop (ie, peel this loop)? 有关编写SSE2 C代码的示例,请参阅我的其他文章如何让GCC完全展开此循环(即剥离此循环)? .

SSE set of instructions, though very useful, are not the latest vector extensions. SSE指令集虽然非常有用,但并不是最新的矢量扩展。 The AVX, advanced vector extensions enhances SSE by providing 3-operands and 4 operands instructions. AVX, 高级向量扩展通过提供3个操作数和4个操作数指令来增强SSE。 See number of operands in instruction set if you are unclear of what this means. 如果您不清楚这意味着什么,请参阅指令集中的操作数数量 3-operands instruction optimizes the commonly seen fused multiply-add (FMA) operation in scientific computing by 1) using 1 fewer register; 3操作数指令优化了科学计算中常见的融合乘加(FMA)运算:1)使用少1个寄存器; 2) reducing the explicit amount of data movement between registers; 2)减少寄存器之间明确的数据移动量; 3) speeding up FMA computations in itself. 3)加速FMA计算本身。 For example of using AVX, see @Nominal Animal's answer to my post . 例如使用AVX,请参阅@Nominal Animal对我帖子的回答

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM