简体   繁体   English

如何用汇编代码编写此代码?

[英]How to write this code with assembly code?

I want to change this code into assembly code, working on mac, how to do this? 我想将此代码更改为汇编代码,在Mac上工作,该怎么做?

while (a --)
{
    *pDest ++ += *pSrc ++;
}

It's intel mac, and on iPhone. 它是intel mac和iPhone。 I am working on a program that uses this code in a thread, and the thread is always doing such stuff, sometimes it's stuck, so I am wondering whether it's because the calculation is too heavy for iPhone. 我正在开发一个在线程中使用此代码的程序,该线程始终在执行此类操作,有时会卡住,所以我想知道是否是因为计算对于iPhone而言太重了。

No, your problem has nothing to do with this code. 不,您的问题与此代码无关。 Let the compiler do its job and optimize this. 让编译器完成其工作并对其进行优化。 Your problem is elsewhere. 您的问题在别处。 It sounds like you have a race condition or deadlock between threads somehow. 听起来您好像在线程之间存在竞争状况或死锁。 I can't psychically debug your problem without more information, but I can say for sure you're barking up the wrong tree. 没有更多的信息,我无法从心理上调试您的问题,但是我可以肯定地说,您正在树错树丛。

The actual assembler instructions will differ, but here's pseudocode that can be translated into assembler pretty easily. 实际的汇编程序指令将有所不同,但是这里的伪代码可以很容易地转换为汇编程序。

note that the *4 is because I'm assuming you're transferring ints. 请注意,* 4是因为我假设您正在传输整数。 It's going to vary depending on the size of the data being transferred. 根据传输数据的大小,它会有所不同。

incrementor = 0 ;really easy
top:
jump to bottom if a equals 0        ;jump if zero is the intel instruction here.
memoryDest[incrementor*4] = memorySrc[incrementor*4] ;this will be a bit messy, you'll probably need some temp variables
incrementor += 1  ;dead easy
jump to top: ;goto. PLEASE DON'T CITE 'CONSIDERED HARMFUL`, THIS IS ASM!!!!11ONEONE
bottom:

Assuming that the arrays in question are of reasonable length and depending on what the types of pDest and pSrc are, you may be able to get a reasonable speedup on this by using the NEON instructions on ARMv7 (iPhone 3GS and the new Touch), and by using SSE on Intel. 假设所讨论的数组的长度合理,并且取决于pDest和pSrc的类型,则可以通过使用ARMv7上的NEON指令(iPhone 3GS和新的Touch)来获得合理的加速,并且通过在英特尔上使用SSE。

The specific code, and how much of a speedup you can get, will depend on the type of data in the source and destination arrays, what alignment guarantees you have on the array addresses, and what the distribution of typical lengths in the arrays is like. 具体的代码以及获得的加速程度将取决于源数组和目标数组中的数据类型,对数组地址的对齐保证以及数组中典型长度的分布是什么样的。

As always, none of this is worth doing unless you have a Shark trace showing that this loop is an appreciable portion of your execution time. 与往常一样,除非您有一个Shark跟踪显示此循环是执行时间的重要部分,否则这一切都不值得做。 If you're doing application-level performance tuning on the Mac or iPhone and you aren't using Shark or Instruments, you're doing it wrong. 如果您正在Mac或iPhone上进行应用程序级的性能调优,而未使用Shark或Instruments,那么您做错了。

If the arrays are floating-point, you can get well-tuned vector code on the Intel mac by including the Accelerate.framework and using the vDSP_vadd( ) function. 如果阵列是浮点的,则可以通过包含Accelerate.framework并使用vDSP_vadd()函数在Intel Mac上获得经过优化的矢量代码。 No assembly coding necessary. 无需汇编代码。

If you have access to the 2008 WWDC talks, Eric Postpischil gave a nice talk on basic vectorization techniques in which he walked through writing vector code to handle exactly this loop (in the case where pSrc and pDest are single-precision arrays) on Intel, though for simplicity he used C with vector intrinsics instead of ASM. 如果您可以访问2008年WWDC对话,那么Eric Postpischil就基本的矢量化技术进行了精彩的演讲,他逐步介绍了如何在Intel上编写矢量代码以精确地处理此循环 (在pSrc和pDest是单精度数组的情况下),尽管为简单起见,他使用带有向量内在函数的C而不是ASM。

So this is for an arm? 这是给胳膊的吗? (iphone?). (苹果手机?)。 What is the size of these pointers (bytes, halfwords, words, etc?) are you having alignment problems (copying words on a non-word boundary)? 这些指针的大小(字节,半字,单词等)的大小是多少?您是否遇到对齐问题(在非单词边界上复制单词)? If these are bytes then yes the code generated is likely painfully slow, the optimizer cant do too much with it. 如果这些是字节,则是的,生成的代码可能会非常缓慢,优化器无法对其进行过多处理。 Where does that leave you? 那会留在哪里? You get what you get. 你得到你得到的。

Here is an example: 这是一个例子:

    mov ip, #0
.L3:
    ldrb    r3, [r0, ip]    @ zero_extendqisi2
    ldrb    r2, [r1, ip]    @ zero_extendqisi2
    add r3, r3, r2
    strb    r3, [r1, ip]
    add ip, ip, #1
    cmp ip, r4
    bne .L3

Because your code had the pointers counting up, the compiler added an instruction that it didnt need. 因为您的代码中的指针一直在计数,所以编译器添加了它不需要的指令。

    sub     ip, rx, #1
.L3:
    ldrb    r3, [r0, ip]    @ zero_extendqisi2
    ldrb    r2, [r1, ip]    @ zero_extendqisi2
    add r3, r3, r2
    strb    r3, [r1, ip]
    subs    ip, ip, #1
    bne .L3

Since the carry bit is not used I wonder if there is a way to load a word and do word based adds, doing one word at a time. 由于未使用进位位,所以我想知道是否有一种方法可以加载一个单词并执行基于单词的加法运算,一次执行一个单词。

load 0xnnmmoopp
load oxqqrrsstt

mask one of them to guarantee no carry problems 掩盖其中之一以保证没有携带问题

0xnnmmoopp -> 0xn0mmo0pp 0xnnmmoopp-> 0xn0mmo0pp

add

0xgghhiikk = 0xn0mmo0pp + 0xqqrrsstt 0xgghhiikk = 0xn0mmo0pp + 0xqqrrsstt

then store hh and kk as bytes 然后将hh和kk存储为字节

you have to go back to the original cripple the mm and pp bytes re-do the add and store the gg and ii bytes. 您必须回到原始的残废mm和pp字节,重新执行添加并存储gg和ii字节。

The two word reads should be significantly faster than four byte reads, if you keep all of the above in registers and do a word store instead of four byte stores that will save quite a bit more time. 如果将以上所有内容都保留在寄存器中并进行字存储而不是四字节存储,则这两个字的读取应该比四字节的读取要快得多,这将节省更多时间。

You will have to save a lot of registers to the stack so it will cost you there so you dont want to do this for small values of a (less than 10 lets say). 您将不得不在堆栈中保存大量寄存器,因此这将使您付出高昂的代价,因此您不想对a的较小值(小于10的话)这样做。

Anyway, something to think about. 无论如何,要考虑的事情。 Just the removal of the one line of code in the asm above should be noticeable for long runs. 从长远来看,只需删除上面asm中的一行代码即可。

EDIT: 编辑:

Actually that modification I did to the compiler output was broken. 实际上,我对编译器输出所做的修改已被破坏。 This is more like it: 这是更喜欢它:

    mov  ip, ra
.L3:
    subs ip, ip, #1
    ldrb r3, [r0, ip]   
    ldrb r2, [r1, ip]   
    add  r3, r3, r2
    strb r3, [r1, ip]
    bne  .L3

A few stackshots will show if this is actually where you're spending time. 一些堆栈快照将显示这是否确实是您要花时间的地方。

If it is , unrolling the loop could help, as in: 如果为 ,展开循环可能会有所帮助,如下所示:

while (a >= 8){
    pDest[0] += pSrc[0];
    pDest[1] += pSrc[1];
    pDest[2] += pSrc[2];
    pDest[3] += pSrc[3];
    pDest[4] += pSrc[4];
    pDest[5] += pSrc[5];
    pDest[6] += pSrc[6];
    pDest[7] += pSrc[7];
    pDest += 8;
    pSrc += 8;
    a -= 8;
}
// followed by your loop

You could code it in assembler, but it probably would not be much better. 您可以在汇编器中对其进行编码,但可能不会更好。

You say that you're developing for iPhone and are trying to improve speed. 您说您正在为iPhone开发,并试图提高速度。 It looks like you're trying to copy a block of memory, for which you probably want to use memcpy(dest, src, size) . 看来您要复制一块内存,可能要使用memcpy(dest,src,size)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM