大量转移-组装

Question

I have a huge number which is loaded on stack and I access it using a eax . 我有大量加载到堆栈上的文件，我使用eax访问。 It can not be stored in a register. 它不能存储在寄存器中。 I'm using eax only to point at it's address (the number is natural type, meaning the first 4 bytes contain the sign, the next 4 the length and the others the actual value). 我使用eax只是指向它的地址（数字是自然类型，这意味着前4个字节包含符号，接下来的4个字节包含长度，其他则为实际值）。

I have to shift it edx times. 我必须改变它的edx时间。 I was thinking about starting from LSB shifting bits one by one (max 8 times / byte) and then copy those bits into the following byte. 我正在考虑从LSB一位开始移位（最多8次/字节），然后将这些位复制到下一个字节中。 In order to do that, I would have to shift the next byte in the first place and so on until MSB position + 1(worst case) or until all the shifts were made and there is no carry flag left. 为此，我必须首先移动下一个字节，依此类推，直到MSB位置+ 1（最坏的情况），或者直到完成所有移位并且没有进位标志为止。 PS I was obviously talking in this particular situation about shl but almost the same applies for shr . PS我显然是在这种特殊情况下谈论shl但是shr情况几乎相同。

Is there any simpler solution? 有没有更简单的解决方案？

Answer 1

The classic 8bit era idea was to use RCL (rotate left with carry) interleaved by DEC counter + JNZ - you can pause for a second and finally appreciate, why x86 DEC/INC instructions affect only zero-flag, but not carry (mystery solved). 经典的8位时代思想是使用由DEC counter + JNZ交错的RCL（带进位向左旋转）-您可以暂停一秒钟，最后欣赏一下，为什么x86 DEC/INC指令仅影响零标志，但不影响进位（谜底已解决））。

So the code would go along these lines: 因此，代码将遵循以下几行：

    mov   edi,address_of_last_byte
    mov   edx,count_of_bytes
    mov   cl,1
    clc   ; clear CF
loop_1_bit_left:
    rcl   byte [edi],cl    ; CF -> LSB, MSB -> CF
    dec   edi    ; preserves CF! Goes from last byte to first one
    dec   edx    ; preserves CF! Decrement counter
    jnz   loop_1_bit_left  ; till whole buffer is shifted
    ; CF has last bit, will be thrown away unless you do something about it

Now this leaves a lot to be desired... 现在这有很多不足之处...

How to save the MSB of buffer? 如何保存缓冲区的MSB？ I would first calculate the required size of buffer after shift (new_length = arg_length + (shift+7)/8)). 我将首先计算移位后所需的缓冲区大小（new_length = arg_length +（shift + 7）/ 8）。 And copy the input into it, and then shift not the arg_length bytes, but new_length bytes, that resolves problem with truncation of MSB. 并将输入复制到其中，然后不移位arg_length个字节，而是移位new_length个字节，从而解决了MSB截断的问题。

But there's another problem, performance. 但是还有另一个问题，性能。 The rcl on modern x86 CPU is unfortunately slow, so doing for example shift by 315 bits in this way is very bad idea. 不幸的是，现代x86 CPU上的rcl速度很慢，因此以这种方式进行315位移位是非常糟糕的主意。 But you don't have to. 但是您不必。 You can do shift by 312 bits first merely by copying the input number already by 39 bytes off (toward beginning) into the new_length buffer, then do the remaining 3 bit shifts one by one by the loop above. 您可以仅通过将已经减少39个字节的输入数（朝开始处）复制到new_length缓冲区中，来进行312位移位，然后通过上面的循环将剩下的3位进行一次移位。

Plus if you will pad the output buffer enough, you can use dword/qword rcl variants (32b/64b code) to process more bytes at the same time. 另外，如果您将足够填充输出缓冲区，则可以使用dword / qword rcl变体（32b / 64b代码）来同时处理更多字节。 (actually from your description it's not clear who's responsible for allocating the output buffer, if your code will return it somehow on stack (?? I'm not sure in which ABI is this possible with dynamically grown buffer according to shift amount), or allocate it on heap, throw in few more bytes on top, so you can modify few bytes after last regular byte of value, and you can work with dword/qword instead, plus over 4/8B aligned (!) addresses). （实际上，根据您的描述，不清楚您的代码由谁负责分配，如果您的代码将以某种方式将其返回到堆栈上（??我不确定根据移位量动态增长的缓冲区在哪种ABI可能实现），或者将其分配在堆上，再在顶部再添加几个字节，因此您可以在值的最后一个常规字节之后修改几个字节，并且可以改为使用dword / qword以及超过4 / 8B对齐的（！）地址。

EDIT: the word / dword referencing variants of rcl / rcr will work correctly only when the whole big number in array is following little-endian way of x86, and the loop is following correct ++/-- direction (the bits b0-7 are at offset +0 in the byte array, and bits for example b80-b87 are at +10 offset and shifting right will go from MSB(+10) b87 toward LSB(+0) b0). 编辑： word / dword引用的变种rcl / rcr在阵列整个大数量以下的x86小端的方式，只有当将正常工作，且环路下面的正确++ / -方向（位b0-7在字节数组中的偏移量为+0，例如b80-b87的位在偏移量为+10时，右移将从MSB（+10）b87移向LSB（+0）b0）。 My initial byte [edi] example is expecting it to be in big-endian way, with MSB starting at offset +0, and LSB ending at +, so the bits can be viewed in human order b87 .. b0, the little endian has them visually "reversed" per byte group (b7 .. b0 b15 .. b8 ... ... ... b87 ...b80) ... at least I think so, now I'm starting to be so confused. 我的第一个byte [edi]示例期望它采用大端方式，其中MSB从偏移量+0开始，而LSB以+结尾，因此可以按人的顺序查看这些位b87 .. b0，little endian具有每个字节组（b7 .. b0 b15 .. b8 ... ... ... b87 ... b80）在视觉上“反转”了……至少我是这样认为的，现在我开始变得如此困惑。 Simply write the code in one way, create unit tests for simple corner cases and verify results + fix it to produce what you expect. 只需以一种方式编写代码，为简单的极端情况创建单元测试，并验证结果并对其进行修复以产生您所期望的结果。 :D ：D

Just make sure you don't update edi by sub edi,4 ( sub rdi,8 ) in such case, as that would destroy CF content, so instead exploit lea edi[edi-4] way of simple calculation done by addressing mode. 只要确保您在这种情况下不要通过sub edi,4 （ sub rdi,8 ）更新edi ，因为那样会破坏CF内容，因此改用lea edi[edi-4]通过寻址模式进行简单计算的方法。 And adjust counter to have correct /4 || /8 并将计数器调整为正确的/4 || /8 /4 || /8 value. /4 || /8值。

For best performance it would be probably still worth to shift by 1-7 bits in one go: for 1 bit left you may keep the rcl version, for 2-7 bit shift some variant of masking/oring values shifted by target amount in single go, using for example 32b registers to handle 16b read/write of buffer and keeping the shifted-out bits in upper half. 为了获得最佳性能，可能仍然值得一口气将1-7位移位：对于左1位，您可以保留rcl版本，对于2-7位移位，某些掩蔽/运算值的变体将按目标量单次移位例如，使用32b寄存器来处理16b的缓冲区读/写操作，并将移出的位保留在上半部分。 Or if you will go that far, maybe the 1 bit variant with shl/and/or can be profiled, whether it's not faster than the rcl one. 或者，如果您走得那么远，也许它与shl/and/or 1位变体是否兼容，是否不比rcl快。 As the rcl is not used by compilers, particular CPU may prefer instead several shl/and/or instructions over single rcl . 由于编译器未使用rcl ，因此特定的CPU可能会比单个rcl更喜欢一些shl/and/or指令。

Fun fact: my very first Z80 Assembly code which I wrote completely alone was doing this, shifting one huge area of memory 1 bit left (and right). 有趣的事实：我完全独自编写的第一个Z80汇编代码就是这样做的，将一个巨大的内存区域向左（向右）移动了一点。 As that huge memory area was actually video ram of ZX Spectrum computer, it was effectively moving image left/right by 1 pixel (ZX used 1 bit per pixel). 由于该巨大的存储区实际上是ZX Spectrum计算机的视频内存，因此可以有效地左右移动图像1个像素（ZX每个像素使用1位）。

And I didn't realize it's possible to use CF from one rotate to other, so I did this by masking the bit separately, copying it into other register, then restoring it from there into new byte, etc. 而且我没有意识到可以从一个旋转到另一个旋转使用CF，因此我通过分别屏蔽位，将其复制到另一个寄存器，然后从那里恢复到新的字节等方式来做到这一点。

So I wrote it, run it (did reset the ZX because of bug), fixed the bug, run it, and watched how the image is moving ... like 10 times slower (somewhere around 3 frames per second) than I expected from "almighty fast Assembly code". 所以我写了它，运行它（由于错误而重置了ZX），修复了该错误，运行了它，并观察了图像的移动方式……比我预期的慢了10倍（每秒约3帧） “全能的快速汇编代码”。 Then a friend of mine did show me how to just rotate it, which made the code run somewhere toward 20 FPS (which still made me realize that even the "fast assembly" is not unlimited and I have to work out my code a lot to get anything decent looking on the screen on the ZX). 然后我的一个朋友确实向我展示了如何旋转它，这使代码朝着20 FPS的方向运行（这仍然使我意识到，即使“快速汇编”也不是无限的，我必须花很多时间来编写代码，在ZX的屏幕上看到任何看起来不错的东西）。

Answer 2

我宁愿ROL或ROR值，切掉翻转的位，并将它们应用于下一个字节（对它应用完全相同的过程之后）

大量转移-组装

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-12-05 12:09:56

解决方案2
0 2016-12-05 10:34:11

大量转移-组装

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-12-05 12:09:56

解决方案2 0 2016-12-05 10:34:11

解决方案1
1 已采纳 2016-12-05 12:09:56

解决方案2
0 2016-12-05 10:34:11