简体   繁体   中英

Assembler function on 64-bit platform on Delphi

I have the following function and need to make it compatible with 64-bit platform:

procedure ExecuteAsm(Tab, Buf: Pointer; Len: DWORD);
asm
     mov   ebx, Tab
     mov   ecx, Len
     mov   edx, Buf
@1:  mov   al,  [edx]
     xlat
     mov   [edx], al
     inc   edx
     dec   ecx
     jnz @1
end;

Delphi XE5 raises error [dcc64 Error] E2107 Operand size mismatch on the lines with Tab and Len parameters.

Unfortunately I don't know assembler enough to fix the issue myself. What should I change to successfully compile the function?

That assembly code is essentially just doing the following, which would work in both 32bit and 64bit:

procedure ExecuteAsm(Tab, Buf: Pointer; Len: DWORD);
var
  pBuf: PByte;
begin
  pBuf := PByte(Buf);
  repeat
    pBuf^ := PByte(Tab)[pBuf^];
    Inc(pBuf);
    Dec(Len);
  until Len = 0;
end;

So why not just use plain Delphi code and let the compiler deal with the assembly?

Why you are using assembler?

There is no good reason!

This is direct translarion of your asm code to Delphi pascal:

procedure ExecuteAsm(Tab, Buf: PByte; Len: DWORD);
 repeat
   Buf^ := Tab[Buf^];
   inc(Buf);
   dec(Len);
 until Len = 0;
end;

But as you can see now, if value Len is 0 then procedure should corupt program memoray.

...

This code looks better, because while loop test the 0 value and never execute the loop.

procedure ExecuteAsm(Tab, Buf: PByte; Len: cardinal);
begin
  while Len > 0 do
  begin
    Buf^ := Tab[Buf^];
    inc(Buf);
    dec(Len);
  end;
end;

However, if you still like assembler you must preserve ebx/rbx register like...

procedure ExecuteAsm(Tab, Buf: Pointer; Len: DWORD);
asm
    push    ebx   //rbx

//... your code

    pop     ebx   //rbx
end;

EDIT: Added 32 bit and 64 bit tests

Because HeartWare didn't do homework by David Heffernan, I did. Original test made David Heffernan, look HeartWares comments. I have made just a little changes and added two more test cases. This directive is inportant: {$O+} //Turn on compiler optimisation... :)

{$APPTYPE CONSOLE}

uses
  Diagnostics;

 {$O+} //Turn on compiler optimisation... :)

procedure _asm_GJ(Tab, Buf : PByte; Len : Cardinal);
//    32-bit   eax edx           ecx
//    64-bit   rcx rdx           r8
asm
{$IFDEF CPUX64 }
        test    Len, Len
        jz      @exit
@loop:
        movzx   rax, [Buf]
        mov     al, byte ptr[Tab + rax]
        mov     [Buf],al
        inc     Buf
        dec     Len
        jnz     @loop
{$ELSE }
        test    Len, Len
        jz      @exit
        push    ebx
@loop:
        movzx   ebx, [Buf]
        mov     bl,byte ptr[Tab + ebx]
        mov     [Buf], bl
        inc     Buf
        dec     Len
        jnz     @loop
        pop     ebx
{$ENDIF }
@exit:
end;

procedure _asm_HeartWare(Tab, Buf : PByte; Len : Cardinal);
//  32-bit     EAX EDX           ECX
//  64-bit     RCX RDX           R8
asm
    {$IFDEF CPUX64 }
        XCHG    R8,RCX
        JECXZ   @OUT
        XOR     RAX,RAX
    @LOOP:
        MOV     AL,[RDX]
        MOV     AL,[R8+RAX]
        MOV     [RDX],AL
        INC     RDX
        DEC     ECX
        JNZ     @LOOP
        // LOOP @LOOP
    {$ELSE }
        JECXZ   @OUT
        PUSH    EBX
        XCHG    EAX,EBX
        XOR     EAX,EAX
    @LOOP:
        MOV     AL,[EDX+ECX-1]
        MOV     AL,[EBX+EAX]
        MOV     [EDX+ECX-1],AL
        DEC     ECX
        JNZ     @LOOP
        // LOOP @LOOP
        POP     EBX
    {$ENDIF }
    @OUT:
end;

procedure _pas_normal(Tab, Buf: PByte; Len: Cardinal);
begin
  while Len > 0 do begin
    Buf^ := Tab[Buf^];
    inc(Buf);
    dec(Len);
  end;
end;

procedure _pas_inline(Tab, Buf: PByte; Len: Cardinal); inline;
begin
  while Len > 0 do begin
    Buf^ := Tab[Buf^];
    inc(Buf);
    dec(Len);
  end;
end;

var
  Stopwatch: TStopwatch;
  i: Integer;
  x, y: array [0 .. 1023] of Byte;

procedure refresh;
begin
  for i := low(x) to high(x) do
  begin
    x[i] := i mod 256;
    y[i] := (i + 20) mod 256;
  end;
end;

begin
{$IFDEF CPUX64 }
  Writeln('64 bit mode');
{$ELSE }
  Writeln('32 bit mode');
{$ENDIF }
  refresh;
  Stopwatch := TStopwatch.StartNew;
  for i := 1 to 1000000 do
  begin
    _asm_HeartWare(PByte(@x), PByte(@y), SizeOf(x));
  end;
  Writeln('asm HeartWare : ', Stopwatch.ElapsedMilliseconds, 'ms');

  refresh;
  Stopwatch := TStopwatch.StartNew;
  for i := 1 to 1000000 do
  begin
    _asm_GJ(PByte(@x), PByte(@y), SizeOf(x));
  end;
  Writeln('asm GJ        : ', Stopwatch.ElapsedMilliseconds, 'ms');

  refresh;
  Stopwatch := TStopwatch.StartNew;
  for i := 1 to 1000000 do
  begin
    _pas_normal(PByte(@x), PByte(@y), SizeOf(x));
  end;
  Writeln('pas normal    : ', Stopwatch.ElapsedMilliseconds, 'ms');

  refresh;
  Stopwatch := TStopwatch.StartNew;
  for i := 1 to 1000000 do
  begin
    _pas_inline(PByte(@x), PByte(@y), SizeOf(x));
  end;
  Writeln('pas inline    : ', Stopwatch.ElapsedMilliseconds, 'ms');

  Readln;
end.

And results...

在此处输入图片说明

Cunclusion...

There is almost nothing to say! Numbers talk...

Delphi compiler is good, hmm very good!

I have built in test another asm optimisated procedure, because HeartWare asm optimisation isn't real optimisation.

NOTE: Read the accepted answer by GJ as it contains a Pascal implementation that beats the crap out of my version (I seem to confuse the compiler by using ABSOLUTE to overcome the signature problem GJ's implementation has, which is one of the reasons why I didn't use it as the Pascal version, but even when recoded to match the signature and using explicit type casts within the routine, it was still much faster than my Pascal version, and on par with the optimized assembler version, so as stated in my own reply and all the others, use a Pascal implementation when possible, unless it is a time-critical routine called a gazillion times, and an actual benchmark shows that the ASM version is significantly faster - which (in my defense) my benchmark did show.

{$IFDEF MSWINDOWS }
PROCEDURE ExecuteAsm(Tab,Buf : POINTER ; Len : DWORD); ASSEMBLER; Register;
  //      32-bit     EAX EDX             ECX
  //      64-bit     RCX RDX             R8
  ASM
    {$IFDEF CPUX64 }
        XCHG    R8,RCX
        JECXZ   @OUT
        XOR     RAX,RAX
    @LOOP:
        MOV     AL,[RDX]
        MOV     AL,[R8+RAX]
        MOV     [RDX],AL
        INC     RDX
        DEC     ECX
        JNZ     @LOOP
        // LOOP @LOOP
    {$ELSE }
        JECXZ   @OUT
        PUSH    EBX
        XCHG    EAX,EBX
        XOR     EAX,EAX
    @LOOP:
        MOV     AL,[EDX+ECX-1]
        MOV     AL,[EBX+EAX]
        MOV     [EDX+ECX-1],AL
        DEC     ECX
        JNZ     @LOOP
        // LOOP @LOOP
        POP     EBX
    {$ENDIF }
    @OUT:
  END;
{$ELSE }
PROCEDURE ExecuteAsm(Tab,Buf : POINTER ; Len : DWORD);
  VAR
    TabP    : PByte ABSOLUTE Tab;
    BufP    : PByte ABSOLUTE Buf;
    I       : Cardinal;

  BEGIN
    FOR I:=1 TO Len DO BEGIN
      BufP^:=TabP[BufP^];
      INC(BufP)
    END
  END;
{$ENDIF }

This should be a valid substitution for all currently supported compilers and platforms. While I agree that it might be better to use the pure Pascal version, it does lead to some horrendous assembly code with lots of unnecessary reloading of registers (at least in 32-bit), so the pure assembly version is definitely faster.

However, unless you run it like a gazillion times, you probably won't notice it in actual use, and the pure Pascal routine will most likely perform adequately. However, only you can determine if the speed improvement is necessary.

Anyway, here are the timings for executing the PROCEDURE 100.000 times on a 256 byte array (using XE5):

32-bit ASM: 47 ms
64-bit ASM: 47 ms
32-bit PAS: 63 ms
64-bit PAS: 78 ms

and the timings for running it 10.000.000 times in RELEASE configuration:

32-bit ASM: 5281 ms
64-bit ASM: 5281 ms
32-bit PAS: 7765 ms
64-bit PAS: 10031 ms

Still, however, the ASM version beats out the Pascal version in all cases...

And the hand-optimized assembly version performed even better:

32-bit ASM: 1906 ms
64-bit ASM: 1859 ms
32-bit PAS: 7781 ms
64-bit PAS: 10015 ms

And with 10.000 times 25.600 bytes instead:

32-bit ASM: 218 ms
64-bit ASM: 172 ms
32-bit PAS: 734 ms
64-bit PAS: 937 ms

In ALL cases, my ASM routine beats the crap out of the compiler's. I simply can't reproduce your timings... What code and compiler did you use?

The actual code that computes the time is as follows (for the 10.000 times 25.600 bytes):

T:=GetTickCount;
FOR I:=1 TO 10000 DO ExecuteAsm(TAB,BUF,25600);
T:=GetTickCount-T;

Absolutely not sure that it will work correctly but it compiles successfully:

procedure ExecuteAsm(Tab, Buf: Pointer; Len: DWORD);
asm
     mov   rbx, Tab
     mov   ecx, Len
     mov   rdx, Buf
@1:  mov   al,  [rdx]
     xlat
     mov   [rdx], al
     inc   rdx
     dec   ecx
     jnz @1
end;

Is it the correct answer?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM