Here is what I am trying to achieve: a_x*b_x + a_y*b_y + a_z*b_z
I am trying to make a MACRO in assembly that does the above computation.
I am using WORD
s for all of my numbers. Here is my code:
dotProduct MACRO A_X,A_Y,A_Z,B_X,B_Y,B_Z ;a.b (a dot b) = a_x*b_x + a_y*b_y + a_z*b_z
mov ah, A_X
mov al, B_X
imul ax
mov answer, ax
mov ah, A_Y
mov al, B_Y
imul ax
add answer, ax
mov ah, A_Z
mov al, B_Z
imul ax
mov answer, ax
output answer
ENDM
answer BYTE 40 DUP (0)
But I am getting the following errors:
Assembling: plane_line.asm
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(1): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(2): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(4): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(5): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(6): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(8): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(9): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(10): Macro Called From
plane_line.asm(101): Main Line Code
plane_line.asm(101) : error A2070: invalid instruction operands
crossProduct(12): Macro Called From
plane_line.asm(101): Main Line Code
I believe it has to do with the way I am handling the registers.
How should I do this instead?
Both operands of MOV have to be the same size. AL and AH are byte registers.
MASM-style assemblers infer the size of memory locations from the DW
you used after the symbol name. This is why it complains about an operand-size mismatch (with a generic unhelpful error message that also applies to a lot of other problems).
If you actually wanted to load the the first byte of A_X into AL, you'd use an override: mov al, BTYE PTR A_X
.
But that's not what you want, since you do actually want to load 16-bit numbers. The product of two 16-bit numbers can be up to 32 bits (eg 0xffff^2 is 0xfffe0001). So it's probably a good idea to just do 32-bit math.
You're also using imul
incorrectly: imul ax
sets DX:AX = AX * AX
(producing a 32-bit result in a pair of registers). To multiply AH * AL and get the result in AX, you should have used imul ah
. See the insn ref manual entry for IMUL . Also see other links to docs and guides in the x86 tag wiki.
The two-operand form of IMUL is easier to use. It works exactly like ADD, with a destination and a source, producing one result. (It doesn't store the high half of the full-multiply result anywhere, but that's fine for this use-case).
To set up for a 32-bit IMUL, use MOVSX to sign-extend from DW 16-bit memory locations into 32-bit registers.
Anyway, here's what you should do :
movsx eax, A_X ; sign-extend A_X into a 32-bit register
movsx ecx, B_X ; Use a different register that's
imul eax, ecx ; eax = A_X * B_X (as a 32-bit signed integer)
movsx edx, A_Y
movsx ecx, B_Y
imul edx, ecx ; edx = A_Y * B_Y (signed int)
add eax, edx ; add to the previous result in eax.
movsx edx, A_Z
movsx ecx, B_Z
imul edx, ecx ; edx = A_Z * B_Z (signed int)
add eax, edx ; add to the previous result in eax
I'm not sure how your "output" function / macro is supposed to work, but storing the integer into an array of bytes BYTE 40 DUP (0)
seems unlikely. You could do it with mov dword ptr [answer], eax
, but maybe you should just output eax
. Or if output answer
converts eax to a string stored in answer
, then you don't need the mov
first.
I'm assuming your numbers are signed 16-bit to start with. This means that your dot-product can overflow if all the inputs are INT16_MIN (ie -32768 = 0x8000). 0x8000^2 = 0x40000000, which is more than half INT32_MAX. So 32-bit ADDs aren't quite safe, but I assume you're ok with that and don't want to add-with-carry.
Another way : We could use 16-bit IMUL instructions, so we can use it with a memory operand instead of having to separately load with sign-extension. This is a lot less convenient if you do want the full 32-bit result, though, so I'll just illustrate using the low half only.
mov ax, A_X
imul B_X ; DX:AX = ax * B_X
mov cx, ax ; save the low half of the result somewhere else so we can do another imul B_Y and add cx, ax
;or
mov cx, A_X
imul cx, B_X ; result in cx
The fun way: SSE4.1 has a SIMD horizontal dot-product instruction.
; Assuming A_X, A_Y, and A_Z are stored contiguously, and same for B_XYZ
pmovsxwd xmm0, qword ptr [A_X] ; also gets Y and Z, and a high element of garbage
pmovsxwd xmm1, qword ptr [B_X] ; sign-extend from 16-bit elements to 32
cvtdq2ps xmm0, xmm0 ; convert in-place from signed int32 to float
cvtdq2ps xmm1, xmm1
dpps xmm0, xmm1, 0b01110001 ; top 4 bits: sum the first 3 elements, ignore the top one. Low 4 bits: put the result only in the low element
cvtss2si eax, xmm0 ; convert back to signed 32-bit integer
; eax = dot product = a_x*b_x + a_y*b_y + a_z*b_z.
This may actually be slower than the scalar imul code, especially on CPUs that can do two loads per clock and have fast integer multiply (eg Intel SnB-family has imul r32, r32
latency of 3 cycles, with 1 per cycle throughput). The scalar version has lots of instruction-level parallelism: the loads and multiplies are independent, only the adds to combine the results are dependent on each other.
DPPS is slow (4 uops and 13c latency on Skylake, but still one per 1.5c throughput).
Integer SIMD dot product (only requiring SSE2) :
;; SSE2
movq xmm0, qword ptr [A_X] ; also gets Y and Z, and a high element of garbage
pslldq xmm0, 2 ; shift the unwanted garbage out into the next element. [ 0 x y z garbage 0 0 0 ]
movq xmm1, qword ptr [B_X] ; [ x y z garbage 0 0 0 0 ]
pslldq xmm1, 2
;; The low 64 bits of xmm0 and xmm1 hold the xyz vectors, with a zero element
pmaddwd xmm0, xmm1 ; vertical 16b*16b => 32b multiply, and horizontal add of pairs. [ 0*0+ax*bx ay*by+az*bz garbage garbage ]
pshufd xmm1, xmm0, 0b00010001 ; swap the low two 32-bit elements, so ay*by+az*bz is at the bottom of xmm1
paddd xmm0, xmm1
movd eax, xmm0
If you could guarantee that the 2 bytes after A_Z and after B_Z were zero, you could leave out the PSLLDQ byte-shift instructions .
If you don't have to shift a word of garbage out of the low 64, you could usefully do it in an MMX register instead of needing a MOVQ load to get 64 bits zero-extended into a 128-bit register. Then you could PMADDWD with a memory operand. But then you need EMMS. Also, MMX is obsolete, and Skylake has lower throughput for pmaddwd mm, mm
than for pmaddwd xmm,xmm
(or 256b ymm).
Everything here is one-cycle latency on recent Intel, except 5 cycles for PMADDWD. (MOVD is 2 cycles, but you could store directly to memory. The loads obviously have latency too, but they're from fixed addresses so there's no input dependency.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.