简体   繁体   中英

Shifting a huge number - assembly

I have a huge number which is loaded on stack and I access it using a eax . It can not be stored in a register. I'm using eax only to point at it's address (the number is natural type, meaning the first 4 bytes contain the sign, the next 4 the length and the others the actual value).

I have to shift it edx times. I was thinking about starting from LSB shifting bits one by one (max 8 times / byte) and then copy those bits into the following byte. In order to do that, I would have to shift the next byte in the first place and so on until MSB position + 1(worst case) or until all the shifts were made and there is no carry flag left. PS I was obviously talking in this particular situation about shl but almost the same applies for shr .

Is there any simpler solution?

The classic 8bit era idea was to use RCL (rotate left with carry) interleaved by DEC counter + JNZ - you can pause for a second and finally appreciate, why x86 DEC/INC instructions affect only zero-flag, but not carry (mystery solved).

So the code would go along these lines:

    mov   edi,address_of_last_byte
    mov   edx,count_of_bytes
    mov   cl,1
    clc   ; clear CF
loop_1_bit_left:
    rcl   byte [edi],cl    ; CF -> LSB, MSB -> CF
    dec   edi    ; preserves CF! Goes from last byte to first one
    dec   edx    ; preserves CF! Decrement counter
    jnz   loop_1_bit_left  ; till whole buffer is shifted
    ; CF has last bit, will be thrown away unless you do something about it

Now this leaves a lot to be desired...

How to save the MSB of buffer? I would first calculate the required size of buffer after shift (new_length = arg_length + (shift+7)/8)). And copy the input into it, and then shift not the arg_length bytes, but new_length bytes, that resolves problem with truncation of MSB.

But there's another problem, performance. The rcl on modern x86 CPU is unfortunately slow, so doing for example shift by 315 bits in this way is very bad idea. But you don't have to. You can do shift by 312 bits first merely by copying the input number already by 39 bytes off (toward beginning) into the new_length buffer, then do the remaining 3 bit shifts one by one by the loop above.

Plus if you will pad the output buffer enough, you can use dword/qword rcl variants (32b/64b code) to process more bytes at the same time. (actually from your description it's not clear who's responsible for allocating the output buffer, if your code will return it somehow on stack (?? I'm not sure in which ABI is this possible with dynamically grown buffer according to shift amount), or allocate it on heap, throw in few more bytes on top, so you can modify few bytes after last regular byte of value, and you can work with dword/qword instead, plus over 4/8B aligned (!) addresses).


EDIT: the word / dword referencing variants of rcl / rcr will work correctly only when the whole big number in array is following little-endian way of x86, and the loop is following correct ++/-- direction (the bits b0-7 are at offset +0 in the byte array, and bits for example b80-b87 are at +10 offset and shifting right will go from MSB(+10) b87 toward LSB(+0) b0). My initial byte [edi] example is expecting it to be in big-endian way, with MSB starting at offset +0, and LSB ending at +, so the bits can be viewed in human order b87 .. b0, the little endian has them visually "reversed" per byte group (b7 .. b0 b15 .. b8 ... ... ... b87 ...b80) ... at least I think so, now I'm starting to be so confused. Simply write the code in one way, create unit tests for simple corner cases and verify results + fix it to produce what you expect. :D


Just make sure you don't update edi by sub edi,4 ( sub rdi,8 ) in such case, as that would destroy CF content, so instead exploit lea edi[edi-4] way of simple calculation done by addressing mode. And adjust counter to have correct /4 || /8 /4 || /8 value.

For best performance it would be probably still worth to shift by 1-7 bits in one go: for 1 bit left you may keep the rcl version, for 2-7 bit shift some variant of masking/oring values shifted by target amount in single go, using for example 32b registers to handle 16b read/write of buffer and keeping the shifted-out bits in upper half. Or if you will go that far, maybe the 1 bit variant with shl/and/or can be profiled, whether it's not faster than the rcl one. As the rcl is not used by compilers, particular CPU may prefer instead several shl/and/or instructions over single rcl .


Fun fact: my very first Z80 Assembly code which I wrote completely alone was doing this, shifting one huge area of memory 1 bit left (and right). As that huge memory area was actually video ram of ZX Spectrum computer, it was effectively moving image left/right by 1 pixel (ZX used 1 bit per pixel).

And I didn't realize it's possible to use CF from one rotate to other, so I did this by masking the bit separately, copying it into other register, then restoring it from there into new byte, etc.

So I wrote it, run it (did reset the ZX because of bug), fixed the bug, run it, and watched how the image is moving ... like 10 times slower (somewhere around 3 frames per second) than I expected from "almighty fast Assembly code". Then a friend of mine did show me how to just rotate it, which made the code run somewhere toward 20 FPS (which still made me realize that even the "fast assembly" is not unlimited and I have to work out my code a lot to get anything decent looking on the screen on the ZX).

我宁愿ROL或ROR值,切掉翻转的位,并将它们应用于下一个字节(对它应用完全相同的过程之后)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM