mov ah上的“無效指令操作數”，word_variable，在16位數字上使用imul

Question

這是我要實現的目標： a_x*b_x + a_y*b_y + a_z*b_z

我正在嘗試在進行上述計算的裝配體中制作MACRO。

我對所有數字都使用WORD 。 這是我的代碼：

dotProduct   MACRO  A_X,A_Y,A_Z,B_X,B_Y,B_Z ;a.b (a dot b) = a_x*b_x + a_y*b_y + a_z*b_z
    mov ah, A_X
    mov al, B_X
    imul ax
    mov answer, ax
    mov ah, A_Y
    mov al, B_Y
    imul ax
    add answer, ax
    mov ah, A_Z
    mov al, B_Z
    imul ax
    mov answer, ax

    output answer

ENDM

answer BYTE 40 DUP (0)

但是我收到以下錯誤：

Assembling: plane_line.asm
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(1): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(2): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(4): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(5): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(6): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(8): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(9): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(10): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(12): Macro Called From
  plane_line.asm(101): Main Line Code

我認為這與我處理寄存器的方式有關。

我應該怎么做呢？

Answer 1

MOV的兩個操作數必須具有相同的大小。 AL和AH是字節寄存器。

MASM樣式的匯編程序從您在符號名稱后使用的DW推斷出內存位置的大小。 這就是為什么它抱怨操作數大小不匹配（帶有通用的無用的錯誤消息，該消息也適用於許多其他問題）的原因。

如果您確實想將A_X的第一個字節加載到AL中，則可以使用覆蓋： mov al, BTYE PTR A_X 。

但這不是您想要的，因為您實際上確實想加載16位數字。 兩個16位數字的乘積最多可以是32位（例如0xffff ^ 2為0xfffe0001）。 因此，僅進行32位數學運算可能是一個好主意。

您還錯誤地使用了imul ： imul ax設置DX:AX = AX * AX （在一對寄存器中產生32位結果）。 要乘以AH * AL並得到AX的結果，您應該使用imul ah 。 請參閱insn ref手冊中的IMUL 。 另請參閱x86標簽Wiki中指向文檔和指南的其他鏈接。

IMUL的二操作數形式更易於使用。 它的工作方式與ADD完全一樣，具有目標和源，產生一個結果。 （它不會在任何地方存儲全乘結果的上半部分，但這對於這種用例是很好的）。

要設置32位IMUL，請使用MOVSX將 DW 16位存儲單元的符號擴展到32位寄存器中。

無論如何，這是您應該做的 ：

movsx   eax, A_X       ; sign-extend A_X into a 32-bit register
movsx   ecx, B_X       ; Use a different register that's 
imul    eax, ecx       ; eax = A_X * B_X  (as a 32-bit signed integer)

movsx   edx, A_Y
movsx   ecx, B_Y
imul    edx, ecx       ; edx = A_Y * B_Y  (signed int)
add     eax, edx       ; add to the previous result in eax.

movsx   edx, A_Z
movsx   ecx, B_Z
imul    edx, ecx       ; edx = A_Z * B_Z  (signed int)
add     eax, edx       ; add to the previous result in eax

我不確定您的“輸出”函數/宏應該如何工作，但是將整數存儲到字節數組BYTE 40 DUP (0)似乎不太可能。 您可以使用mov dword ptr [answer], eax ，但也許您應該只output eax 。 或者，如果output answer轉換EAX存儲在一個字符串answer ，那么你不需要mov第一。

我假設你的數字符號 16位開始。 這意味着如果所有輸入均為INT16_MIN （即-32768 = 0x8000），則點積可能會溢出。 0x8000 ^ 2 = 0x40000000，是INT32_MAX的一半以上。 因此32位ADD並不是很安全，但是我認為您可以接受，也不想隨身攜帶。

另一種方式 ：我們可以使用16位IMUL指令，因此我們可以將其與內存操作數一起使用，而不必單獨加載符號擴展名。 但是，如果您確實想要完整的32位結果，則這不太方便，因此，我僅說明使用低半部分。

mov    ax, A_X
imul   B_X         ; DX:AX  = ax * B_X
mov    cx, ax      ; save the low half of the result somewhere else so we can do another imul B_Y  and  add cx, ax

;or
mov    cx, A_X
imul   cx, B_X     ; result in cx

在這里停止閱讀，其余內容對初學者沒有用。

有趣的方式：SSE4.1具有SIMD水平點積指令。

; Assuming A_X, A_Y, and A_Z are stored contiguously, and same for B_XYZ
pmovsxwd   xmm0, qword ptr [A_X]  ; also gets Y and Z, and a high element of garbage
pmovsxwd   xmm1, qword ptr [B_X]  ; sign-extend from 16-bit elements to 32
cvtdq2ps   xmm0, xmm0             ; convert in-place from signed int32 to float
cvtdq2ps   xmm1, xmm1

dpps       xmm0, xmm1,  0b01110001  ; top 4 bits: sum the first 3 elements, ignore the top one.  Low 4 bits: put the result only in the low element

cvtss2si   eax, xmm0              ; convert back to signed 32-bit integer
; eax = dot product = a_x*b_x + a_y*b_y + a_z*b_z.

這實際上可能比標量imul代碼要慢，尤其是在每個時鍾可以執行兩個負載並且具有快速整數乘法的CPU上（例如Intel SnB系列的imul r32, r32延遲為3個周期，每周期吞吐量為1）。 標量版本具有很多指令級並行度：加載和乘法是獨立的，只有將結果組合在一起的加法是相互依賴的。

DPPS速度很慢（在Skylake上為4 uops，延遲為13c，但每1.5c吞吐量仍然為1）。

整數SIMD點積（僅要求SSE2） ：

;; SSE2
movq       xmm0, qword ptr [A_X]  ; also gets Y and Z, and a high element of garbage
pslldq     xmm0, 2                ; shift the unwanted garbage out into the next element.  [ 0 x y z   garbage 0 0 0 ]
movq       xmm1, qword ptr [B_X]  ; [ x y z garbage  0 0 0 0 ]
pslldq     xmm1, 2
;; The low 64 bits of xmm0 and xmm1 hold the xyz vectors, with a zero element

pmaddwd    xmm0, xmm1               ; vertical 16b*16b => 32b multiply,  and horizontal add of pairs.  [ 0*0+ax*bx   ay*by+az*bz   garbage  garbage ]

pshufd     xmm1, xmm0, 0b00010001   ; swap the low two 32-bit elements, so ay*by+az*bz is at the bottom of xmm1
paddd      xmm0, xmm1

movd       eax, xmm0

如果可以保證A_Z之后和B_Z之后的2個字節為零，則可以省略PSLLDQ字節移位指令。

如果您不必從低64位中移出一個垃圾字，則可以在MMX寄存器中有用地執行此操作，而不需要MOVQ加載將64位零擴展到128位寄存器中。 然后，您可以將PMADDWD與內存操作數一起使用。 但是隨后您需要EMMS。 此外，MMX已過時， Skylake的pmaddwd mm, mm 吞吐量比pmaddwd xmm,xmm （或256b ymm）的吞吐量低。

除了PMADDWD的5個周期外，此處所有內容都是最近Intel的1個周期延遲。 （MOVD是2個周期，但是您可以直接存儲到內存中。負載顯然也有延遲，但是它們來自固定地址，因此沒有輸入依賴性。）

mov ah上的“無效指令操作數”，word_variable，在16位數字上使用imul

問題描述

1 個解決方案

解決方案1
2 已采納 2016-09-19 21:14:35

在這里停止閱讀，其余內容對初學者沒有用。

mov ah上的“無效指令操作數”，word_variable，在16位數字上使用imul

問題描述

1 個解決方案

解決方案1 2 已采納 2016-09-19 21:14:35

在這里停止閱讀，其余內容對初學者沒有用。

解決方案1
2 已采納 2016-09-19 21:14:35