简体   繁体   English

遍历NASM中的阵列

[英]looping over an array in NASM

I want to learn programming in assembly to write fast and efficient code. 我想学习汇编程序设计,以编写快速有效的代码。 How ever I stumble over a problem I can't solve. 我如何偶然遇到无法解决的问题。

I want to loop over an array of double words and add its components like below: 我想遍历双字数组并添加其组件,如下所示:

%include "asm_io.inc"  
%macro prologue 0
    push    rbp
    mov     rbp,rsp
    push    rbx
    push    r12
    push    r13
    push    r14
    push    r15
%endmacro
%macro epilogue 0
    pop     r15
    pop     r14
    pop     r13
    pop     r12
    pop     rbx
    leave
    ret
%endmacro

segment .data
string1 db  "result: ",0
array   dd  1, 2, 3, 4, 5

segment .bss


segment .text
global  sum

sum:
    prologue

    mov  rdi, string1
    call print_string

    mov  rbx, array
    mov  rdx, 0
    mov  ecx, 5

lp:
    mov  rax, [rbx]
    add  rdx, rax
    add  rbx, 4
    loop lp

    mov  rdi, rdx
    call print_int
    call print_nl

epilogue

Sum is called by a simple C-driver. Sum由简单的C驱动程序调用。 The functions print_string, print_int and print_nl look like this: 函数print_string,print_int和print_nl如下所示:

section .rodata
int_format  db  "%i",0
string_format db "%s",0

section .text
global  print_string, print_nl, print_int, read_int
extern printf, scanf, putchar

print_string:
    prologue
    ; string address has to be passed in rdi
    mov     rsi,rdi
    mov     rdi,dword string_format
    xor     rax,rax
    call    printf
    epilogue

print_nl:
    prologue
    mov     rdi,0xA
    xor     rax,rax
    call    putchar
    epilogue

print_int:
    prologue
    ;integer arg is in rdi
    mov     rsi, rdi
    mov     rdi, dword int_format
    xor     rax,rax
    call    printf
    epilogue

When printing the result after summing all array elements it says "result: 14" instead of 15. I tried several combinations of elements, and it seems that my loop always skips the first element of the array. 在对所有数组元素求和后打印结果时,它说的是“结果:14”而不是15。我尝试了元素的几种组合,看来我的循环总是跳过数组的第一个元素。 Can somebody tell me why th loop skips the first element? 有人可以告诉我为什么循环跳过第一个元素吗?

Edit 编辑

I forgot to mention that I'm using a x86_64 Linux system 我忘了提到我正在使用x86_64 Linux系统

I'm not sure why your code is printing the wrong number. 我不确定您的代码为什么打印了错误的数字。 Probably an off-by-one somewhere that you should track down with a debugger. 您应该通过调试器进行跟踪的某个地方可能是一个一个的地方。 gdb with layout asm and layout reg should help. 具有layout asmlayout reg gdb应该会有所帮助。 Actually, I think you're going one past the end of the array. 实际上,我认为您要在数组末尾走一遍。 There's probably a -1 there, and you're adding it to your accumulator. 那里可能是-1,然后将其添加到累加器中。

If your ultimate goal is writing fast & efficient code, you should have a look at some of the links I added recently to https://stackoverflow.com/tags/x86/info . 如果您的最终目标是编写快速高效的代码,则应该看看我最近添加到https://stackoverflow.com/tags/x86/info的一些链接。 Esp. ESP。 Agner Fog's optimization guides are great for helping you understand what runs efficiently on today's machines, and what doesn't. Agner Fog的优化指南非常适合帮助您了解当今机器上有效运行的内容,而不是有效运行的内容。 eg leave is shorter, but takes 3 uops, compared to mov rsp, rbp / pop rbp taking 2. Or just omit the frame pointer. 例如, leave比较短,但是比mov rsp, rbp / pop rbp占用2的时间少3 oups。或者只是省略帧指针。 (gcc defaults to -fomit-frame-pointer for amd64 these days.) Messing around with rbp just wastes instructions and costs you a register, esp. (这些天,gcc默认将amd64的默认值设置为-fomit-frame-pointer 。)使用rbp进行处理只会浪费指令,并且会浪费您的注册费用,尤其是。 in functions that are worth writing in ASM (ie usually everything lives in registers, and you don't call other functions). 值得在ASM中编写的函数中(即通常所有内容都存放在寄存器中,并且您不调用其他函数)。


The "normal" way to do this would be write your function in asm, call it from C to get the results, and then print the output with C. If you want your code to be portable to Windows, you can use something like 执行此操作的“常规”方法是在asm中编写函数,从C调用该函数以获取结果,然后使用C打印输出。如果您希望代码可移植到Windows,则可以使用类似

#define SYSV_ABI __attribute__((sysv_abi))
int SYSV_ABI myfunc(void* dst, const void* src, size_t size, const uint32_t* LH);

Then even if you compile for Windows, you don't have to change your ASM to look for its args in different registers. 这样,即使您为Windows编译,也不必更改ASM在不同的寄存器中查找其args。 (The SysV calling convention is nicer than the Win64: more args in registers, and all the vector registers are allowed to be used without saving them.) Make sure you have a new enough gcc, that has the fix for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66275 , though. (SysV调用约定比Win64更好:寄存器中有更多的args,并且所有向量寄存器都可以使用而无需保存它们。)确保您有一个足够新的gcc,并且具有适用于https:// gcc的修复程序.gnu.org / bugzilla / show_bug.cgi?id = 66275

An alternative is to use some assembler macros to %define some register names so you can assemble the same source for Windows or SysV ABIs. 一种替代方法是使用一些汇编宏来%define一些寄存器名称,以便您可以为Windows或SysV ABI汇编相同的源。 Or have a Windows entry-point before the regular one, which uses some MOV instructions to put args in the registers the rest of the function is expecting. 或者在常规入口之前有一个Windows入口点,该入口点使用一些MOV指令将args放在函数其余部分期望的寄存器中。 But that obviously is less efficient. 但这显然效率较低。


It's useful to know what function calls look like in asm, but writing them yourself is a waste of time, usually. 知道什么函数在asm中看起来是有用的,但是通常自己编写它们是浪费时间。 Your finished routine will just return a result (in a register or memory), not print it. 您完成的例程将仅返回结果(在寄存器或存储器中),而不打印结果。 Your print_int etc. routines are hilariously inefficient. 您的print_int等例程效率print_int (push/pop every callee-saved register, even though you use none of them, and multiple calls to printf instead of using a single format string ending with a \\n .) I know you didn't claim this code was efficient, and that you're just learning. (即使没有使用任何一个,也按/弹出每个保存有被调用方的寄存器,并且多次调用printf而不是使用以\\n结尾的单个格式的字符串。)我知道您并不认为代码有效,并且你只是在学习。 You probably already had some idea that this wasn't very tight code. 您可能已经知道这不是很严格的代码。 :P :P

My point is, compilers are REALLY good at their job, most of the time. 我的意思是,大多数时候,编译器确实非常擅长于他们的工作。 Spend your time writing asm ONLY for the hot parts of your code: usually just a loop, sometimes including the setup / cleanup code around it. 花时间只为代码的热门部分编写asm:通常只是一个循环,有时还包括围绕它的设置/清除代码。


So, on to your loop : 因此, 进入循环

lp:
    mov  rax, [rbx]
    add  rdx, rax
    add  rbx, 4
    loop lp

Never use the loop instruction . 切勿使用loop指令 It decodes to 7 uops, vs. 1 for a macro-fused compare-and-branch. 与宏融合的比较和分支的1相比,它解码为7 oups。 loop has a max throughput of one per 5 cycles (Intel Sandybridge/Haswell and later). loop的最大吞吐量为每5个周期之一(Intel Sandybridge / Haswell及更高版本)。 By comparison, dec ecx / jnz lp or cmp rbx, array_end / jb lp would let your loop run at one iteration per cycle. 相比之下, dec ecx / jnz lpcmp rbx, array_end / jb lp将使您的循环每个循环运行一次。

Since you're using a single-register addressing mode, using add rdx, [rbx] would also be more efficient than a separate mov -load. 由于您使用的是单寄存器寻址模式,因此使用add rdx, [rbx]也比单独的mov -load更有效。 (It's a more complicated tradeoff with indexed addressing modes, since they can only micro-fuse in the decoders / uop-cache, not in the rest of the pipeline, on Intel SnB-family . In this case, add rdx, [rbx+rsi] or something would stay micro-fused on Haswell and later). (对于索引寻址模式,这是一个更复杂的权衡, 因为它们只能在解码器/ uop缓存中微熔丝,而不能在英特尔SnB系列的其余管线中微熔丝 。在这种情况下,请add rdx, [rbx+rsi]或在Haswell及更高版本上会保持某些微融合)。

When writing asm by hand, if it's convenient, help yourself out by keeping source pointers in rsi and dest pointers in rdi. 手工编写asm时,如果方便的话,请将源指针保留在rsi中,将dest指针保留在rdi中,以帮助自己。 The movs insn implicitly uses them that way, which is why they're named si and di . movs insn以这种方式隐式使用它们,这就是为什么它们被命名为sidi Never use extra mov instructions just because of register names, though. 但是,切勿仅因寄存器名称而使用多余的mov指令。 If you want more readability, use C with a good compiler. 如果要提高可读性,请将C与良好的编译器一起使用。

;;; This loop probably has lots of off-by-one errors
;;; and doesn't handle array-length being odd
mov rsi, array
lea rdx, [rsi + array_length*4]  ; if len is really a compile-time constant, get your assembler to generate it for you.
mov eax, [rsi]   ; load first element
mov ebx, [rsi+4] ; load 2nd element
add rsi, 8       ; eliminate this insn by loading array+8 in the first place earlier
; TODO: handle length < 4

ALIGN 16
.loop:
    add eax, [    rsi]
    add ebx, [4 + rsi]
    add rsi, 8
    cmp rsi, rdx
    jb .loop         ;  loop while rsi is Below one-past-the-end
;  TODO: handle odd-length
add eax, ebx
ret

Don't use this code without debugging it . 不要在未经调试的情况下使用此代码 gdb (with layout asm and layout reg ) is not bad, and available in every Linux distro. gdb(具有layout asmlayout reg )还不错,并且在每个Linux发行版中都可用。

If your arrays are always going to be very short compile-time-constant lengths, just fully unroll the loops. 如果您的数组总是要具有非常短的编译时常数,则只需完全展开循环即可。 Otherwise, an approach like this, with two accumulators, lets two additions happen in parallel. 否则,具有两个累加器的这种方法将使两个加法并行发生。 (Intel and AMD CPUs have two load ports, so they can sustain two adds from memory per clock. Haswell has 4 execution ports that can handle scalar integer ops, so it can execute this loop at 1 iteration per cycle. Previous Intel CPUs can issue 4 uops per cycle, but the execution ports will get behind on keeping up with them. Unrolling to minimize loop overhead would help.) (Intel和AMD CPU具有两个加载端口,因此它们每个时钟可以从内存中承受两个加法。Haswell具有4个执行端口,可以处理标量整数运算,因此它可以在每个周期1次迭代中执行此循环。以前的Intel CPU可以发出每个周期4 uops,但是执行端口将跟不上它们。展开以最小化循环开销将有所帮助。)

All these techniques (esp. multiple accumulators) are equally applicable to vector instructions. 所有这些技术(尤其是多个累加器)同样适用于矢量指令。

segment .rodata         ; read-only data
ALIGN 16
array:  times 64    dd  1, 2, 3, 4, 5
array_bytes equ $-array
string1 db  "result: ",0

segment .text
; TODO: scalar loop until rsi is aligned
; TODO: handle length < 64 bytes
lea rsi, [array + 32]
lea rdx, [rsi - 32 + array_bytes]  ;  array_length could be a register (or 4*a register, if it's a count).
; lea rdx, [array + array_bytes] ; This way would be lower latency, but more insn bytes, when "array" is a symbol, not a register.  We don't need rdx until later.
movdqu xmm0, [rsi - 32]   ; load first element
movdqu xmm1, [rsi - 16] ; load 2nd element
; note the more-efficient loop setup that doesn't need an add rsi, 32.

ALIGN 16
.loop:
    paddd  xmm0, [     rsi]   ; add packed dwords
    paddd  xmm1, [16 + rsi]
    add rsi, 32
    cmp rsi, rdx
    jb .loop         ;  loop: 4 fused-domain uops
paddd   xmm0, xmm1
phaddd  xmm0, xmm0     ; horizontal add: SSSE3 phaddd is simple but not optimal.  Better to pshufd/paddd
phaddd  xmm0, xmm0
movd    eax, xmm0
;  TODO: scalar cleanup loop
ret

Again, this code probably has bugs, and doesn't handle the general case of alignment and length. 同样,此代码可能存在错误,无法处理对齐和长度的一般情况。 It's unrolled so each iteration does two * four packed ints = 32bytes of input data. 它已展开,因此每次迭代都执行两个* 4个压缩整数= 32字节的输入数据。

It should run at one iteration per cycle on Haswell, otherwise 1 iteration per 1.333 cycles on SnB/IvB. 它应该在Haswell上每个周期运行一次迭代,否则在SnB / IvB上每1.333个周期运行1次迭代。 The frontend can issue all 4 uops in a cycle, but the execution units can't keep up without Haswell's 4th ALU port to handle the add and macro-fused cmp/jb . 前端可以在一个周期内发出所有4个指令,但是如果没有Haswell的第4个ALU端口来处理add和宏融合的cmp/jb ,执行单元就无法跟上。 Unrolling to 4 paddd per iteration would do the trick for Sandybridge, and probably help on Haswell, too. 每次迭代将4个paddd将为Sandybridge带来成功,并且可能也对Haswell有所帮助。

With AVX2 vpadd ymm1, [32+rsi] , you get double the throughput (if the data is in the cache, otherwise you still bottleneck on memory). 使用AVX2 vpadd ymm1, [32+rsi] ,您可以获得两倍的吞吐量(如果数据在高速缓存中,否则仍然会成为内存瓶颈)。 To do the horizontal sum for a 256b vector, start with a vextracti128 xmm1, ymm0, 1 / vpaddd xmm0, xmm0,xmm1 , and then it's the same as the SSE case. 要对256b向量进行水平求和,请从vextracti128 xmm1, ymm0, 1 / vpaddd xmm0, xmm0,xmm1 ,然后与SSE情况相同。 See this answer for more details about efficient shuffles for horizontal ops . 请参阅此答案,以获取有关水平操作的有效洗牌的更多详细信息

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM