简体   繁体   中英

“invalid instruction operands” on mov ah, word_variable, and using imul on 16-bit numbers

Here is what I am trying to achieve: a_x*b_x + a_y*b_y + a_z*b_z

I am trying to make a MACRO in assembly that does the above computation.

I am using WORD s for all of my numbers. Here is my code:

dotProduct   MACRO  A_X,A_Y,A_Z,B_X,B_Y,B_Z ;a.b (a dot b) = a_x*b_x + a_y*b_y + a_z*b_z
    mov ah, A_X
    mov al, B_X
    imul ax
    mov answer, ax
    mov ah, A_Y
    mov al, B_Y
    imul ax
    add answer, ax
    mov ah, A_Z
    mov al, B_Z
    imul ax
    mov answer, ax

    output answer

ENDM

answer BYTE 40 DUP (0)

But I am getting the following errors:

Assembling: plane_line.asm
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(1): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(2): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(4): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(5): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(6): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(8): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(9): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(10): Macro Called From
  plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
 crossProduct(12): Macro Called From
  plane_line.asm(101): Main Line Code

I believe it has to do with the way I am handling the registers.

How should I do this instead?

Both operands of MOV have to be the same size. AL and AH are byte registers.

MASM-style assemblers infer the size of memory locations from the DW you used after the symbol name. This is why it complains about an operand-size mismatch (with a generic unhelpful error message that also applies to a lot of other problems).

If you actually wanted to load the the first byte of A_X into AL, you'd use an override: mov al, BTYE PTR A_X .


But that's not what you want, since you do actually want to load 16-bit numbers. The product of two 16-bit numbers can be up to 32 bits (eg 0xffff^2 is 0xfffe0001). So it's probably a good idea to just do 32-bit math.

You're also using imul incorrectly: imul ax sets DX:AX = AX * AX (producing a 32-bit result in a pair of registers). To multiply AH * AL and get the result in AX, you should have used imul ah . See the insn ref manual entry for IMUL . Also see other links to docs and guides in the tag wiki.

The two-operand form of IMUL is easier to use. It works exactly like ADD, with a destination and a source, producing one result. (It doesn't store the high half of the full-multiply result anywhere, but that's fine for this use-case).

To set up for a 32-bit IMUL, use MOVSX to sign-extend from DW 16-bit memory locations into 32-bit registers.

Anyway, here's what you should do :

movsx   eax, A_X       ; sign-extend A_X into a 32-bit register
movsx   ecx, B_X       ; Use a different register that's 
imul    eax, ecx       ; eax = A_X * B_X  (as a 32-bit signed integer)

movsx   edx, A_Y
movsx   ecx, B_Y
imul    edx, ecx       ; edx = A_Y * B_Y  (signed int)
add     eax, edx       ; add to the previous result in eax.

movsx   edx, A_Z
movsx   ecx, B_Z
imul    edx, ecx       ; edx = A_Z * B_Z  (signed int)
add     eax, edx       ; add to the previous result in eax

I'm not sure how your "output" function / macro is supposed to work, but storing the integer into an array of bytes BYTE 40 DUP (0) seems unlikely. You could do it with mov dword ptr [answer], eax , but maybe you should just output eax . Or if output answer converts eax to a string stored in answer , then you don't need the mov first.

I'm assuming your numbers are signed 16-bit to start with. This means that your dot-product can overflow if all the inputs are INT16_MIN (ie -32768 = 0x8000). 0x8000^2 = 0x40000000, which is more than half INT32_MAX. So 32-bit ADDs aren't quite safe, but I assume you're ok with that and don't want to add-with-carry.


Another way : We could use 16-bit IMUL instructions, so we can use it with a memory operand instead of having to separately load with sign-extension. This is a lot less convenient if you do want the full 32-bit result, though, so I'll just illustrate using the low half only.

mov    ax, A_X
imul   B_X         ; DX:AX  = ax * B_X
mov    cx, ax      ; save the low half of the result somewhere else so we can do another imul B_Y  and  add cx, ax

;or
mov    cx, A_X
imul   cx, B_X     ; result in cx

Stop reading here, the rest of this is not useful for beginners.

The fun way: SSE4.1 has a SIMD horizontal dot-product instruction.

; Assuming A_X, A_Y, and A_Z are stored contiguously, and same for B_XYZ
pmovsxwd   xmm0, qword ptr [A_X]  ; also gets Y and Z, and a high element of garbage
pmovsxwd   xmm1, qword ptr [B_X]  ; sign-extend from 16-bit elements to 32
cvtdq2ps   xmm0, xmm0             ; convert in-place from signed int32 to float
cvtdq2ps   xmm1, xmm1

dpps       xmm0, xmm1,  0b01110001  ; top 4 bits: sum the first 3 elements, ignore the top one.  Low 4 bits: put the result only in the low element

cvtss2si   eax, xmm0              ; convert back to signed 32-bit integer
; eax = dot product = a_x*b_x + a_y*b_y + a_z*b_z.

This may actually be slower than the scalar imul code, especially on CPUs that can do two loads per clock and have fast integer multiply (eg Intel SnB-family has imul r32, r32 latency of 3 cycles, with 1 per cycle throughput). The scalar version has lots of instruction-level parallelism: the loads and multiplies are independent, only the adds to combine the results are dependent on each other.

DPPS is slow (4 uops and 13c latency on Skylake, but still one per 1.5c throughput).


Integer SIMD dot product (only requiring SSE2) :

;; SSE2
movq       xmm0, qword ptr [A_X]  ; also gets Y and Z, and a high element of garbage
pslldq     xmm0, 2                ; shift the unwanted garbage out into the next element.  [ 0 x y z   garbage 0 0 0 ]
movq       xmm1, qword ptr [B_X]  ; [ x y z garbage  0 0 0 0 ]
pslldq     xmm1, 2
;; The low 64 bits of xmm0 and xmm1 hold the xyz vectors, with a zero element

pmaddwd    xmm0, xmm1               ; vertical 16b*16b => 32b multiply,  and horizontal add of pairs.  [ 0*0+ax*bx   ay*by+az*bz   garbage  garbage ]

pshufd     xmm1, xmm0, 0b00010001   ; swap the low two 32-bit elements, so ay*by+az*bz is at the bottom of xmm1
paddd      xmm0, xmm1

movd       eax, xmm0

If you could guarantee that the 2 bytes after A_Z and after B_Z were zero, you could leave out the PSLLDQ byte-shift instructions .

If you don't have to shift a word of garbage out of the low 64, you could usefully do it in an MMX register instead of needing a MOVQ load to get 64 bits zero-extended into a 128-bit register. Then you could PMADDWD with a memory operand. But then you need EMMS. Also, MMX is obsolete, and Skylake has lower throughput for pmaddwd mm, mm than for pmaddwd xmm,xmm (or 256b ymm).

Everything here is one-cycle latency on recent Intel, except 5 cycles for PMADDWD. (MOVD is 2 cycles, but you could store directly to memory. The loads obviously have latency too, but they're from fixed addresses so there's no input dependency.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM