简体   繁体   English

想了解浮点指令推算的负载

[英]Want to understand load imputed by floating point instructions

At outset, this may be a part-discussion part-solving kind of questions. 首先,这可能是部分讨论部分解决的问题。 No intent to offend anyone there. 无意冒犯任何人。

I have written in 64 bit assembly the algorithm to generate MT Prime based random number generator for 64 bits. 我已经编写了64位汇编算法,以生成基于MT Prime的64位随机数生成器。 This generator function is required to be called 8 billion times to populate an array of size 2048x2048x2048, and generate a random no between 1..small_value (usually, 32) 该生成器函数需要调用80亿次才能填充大小为2048x2048x2048的数组,并生成介于1..small_value(通常为32)之间的随机数

Now I had two next steps possibilities : 现在,我有两个后续步骤:

(a) Keep generating numbers, compare with the limits [1..32] and discard those that don't fall within. (a)继续生成数字,与限制[1..32]进行比较,并丢弃不属于该范围的数字。 The run time for this logic is 181,817 ms, measured by calling clock() function. 此逻辑的运行时间为181,817毫秒,通过调用clock()函数进行测量。

(b) take the 64 bit random number output in RAX,and scale it using FPU to be between [0..1], and then scale it up in the desired range [1..32] The code sequence for this is as below : (b)将RAX中的64位随机数输出,并使用FPU对其进行缩放以使其在[0..1]之间,然后将其按比例缩放至所需的范围[1..32],其代码序列如下下面:

 mov word ptr initialize_random_number_scaling,dx
 fnclex             ; clears status flag
 call generate_fp_random_number ; returns a random number in ST(0) between [0..1]
 fimul word ptr initialize_random_number_scaling ; Mults ST(0) & stores back in ST(0)
 mov word ptr initialize_random_number_base,ax ; Saves base to a memory
 fiadd word ptr initialize_random_number_base  ; adds the base to the scaled fp number
 frndint                            ; rounds off the ST(0)
 fist word ptr initialize_random_number_result ; and stores this number to result.
 ffree st(0)               ; releases ST(0)
 fincstp                       ; Logically pops the FPU
 mov ax, word ptr initialize_random_number_result       ; and saves it to AX

And the instructions in generate_fp_random_number are as below : 并且generate_fp_random_number中的指令如下:

 shl rax,1  ; RAX gets the original 64 bit random number using MT prime algorithm
 shr ax,1   ; Clear top bit
 mov qword ptr random_number_generator_act_number,rax ; Save the number in memory as we cannot move to ST(0) a number from register
 fild   qword ptr random_number_generator_max_number    ; Load 0x7FFFFFFFFFFFFFFFH
 fild   qword ptr random_number_generator_act_number    ; Load our number
 fdiv   st(0),st(1) ; We return the value through ST(0) itself, divide our random number with max possible number
 fabs
 ffree st(1)    ; release the st(1)
 fld1           ; push to top of stack a 1.0
 fcomip st(0), st(1)    ; compares our number in ST(1) with ST(0) and sets CF.
 jc generate_fp_random_get_next_no ; if ST(0) (=1.0) < ST(1) (our no), we need a new no
 fldz               ; push to top of stack a 0.0
 fcomip st(0),st(1) ; if ST(0) (=0.0) >ST(1) (our no) clears CF
 jnc generate_fp_random_get_next_no ; so if the number is above zero the CF will be set
 fclex

The problem is, just by adding these instructions, the run time jumps to a whopping 5,633,963 ms! 问题是,仅通过添加这些指令,运行时间就会跃升至高达5,633,963毫秒! I have written the above using xmm registers as an alternative, and the difference is absolutely marginal. 我已经使用xmm寄存器作为替代方法编写了以上代码,两者之间的差异绝对很小。 (5,633,703 ms). (5,633,703毫秒)。

Would anyone kindly guide me on what degree of load do these additional instructions impute to the total run time? 任何人都可以在这些负载对总运行时间造成多少负担的帮助下指导我吗? Is the FPU really this slow ? FPU真的这么慢吗? Or am I missing a trick? 还是我错过了一个把戏? As always, all ideas are welcome and am grateful for your time and efforts. 一如既往地欢迎所有想法,并感谢您的时间和努力。

Env : Windows 7 64 bit on Intel 2700K CPU overclocked to 4.4 GHz 16 GB RAM debugged in VS 2012 Express environment Env:在VS 2012 Express环境中调试的Windows 7 64位在Intel 2700K CPU上超频至4.4 GHz 16 GB RAM

"mov word ptr initialize_random_number_base,ax ; Saves base to a memory" “移动字ptr initialize_random_number_base,ax;将基数保存到内存中”

If you want the max speed you must find out how to separate write instructions and write data into different sections of memory 如果要获得最大速度,则必须找出如何分开写指令并将数据写到内存的不同部分中的方法

Rewriting data in the same area of cache creates a "self modifying code" situation 在高速缓存的同一区域中重写数据会产生“自我修改代码”的情况

Your compiler may do this, it may not. 您的编译器可能会这样做,但可能不会。 You need to know this because unoptimised assembly code runs 10 to 50 times slower 您需要知道这一点,因为未经优化的汇编代码运行速度慢了10到50倍

"All modern processors cache code and data memory for efficiency. Performance of assembly-language code can be seriously impaired if data is written to the same block of memory as that in which the code is executing, because it may cause the CPU repeatedly to reload the instruction cache (this is to ensure self-modifying-code works correctly). To avoid this, you should ensure that code and (writable) data do not occupy the same 2 Kbyte block of memory. " “所有现代处理器都会缓存代码和数据存储器以提高效率。如果将数据写入与执行代码的内存相同的内存块中,则汇编语言代码的性能可能会受到严重损害,因为这可能会导致CPU反复重新加载指令缓存(这是为了确保自修改代码正常工作。)为避免这种情况,您应确保代码和(可写)数据不占用相同的2 KB内存块。”

http://www.bbcbasic.co.uk/bbcwin/manual/bbcwina.html#cache http://www.bbcbasic.co.uk/bbcwin/manual/bbcwina.html#cache

There's a ton of stuff in your code that I can see no reason for. 您的代码中有很多东西,我看不出有什么理由。 If there was a reason, feel free to correct me, but otherwise here are my alternatives: 如果有原因,请随时纠正我,否则,这里是我的替代方法:

For generate_fp_random_number 对于generate_fp_random_number

shl rax, 1
shr rax, 1
mov qword ptr act_number, rax
fild qword ptr max_number
fild qword ptr act_number
fdivrp   ; divide actual by max and pop
; and that's it. It's already within bounds.
; It can't be outside [0, 1] by construction.
; It can't be < 0 because we just divided two positive number,
; and it can't be > 1 because we divided by the max it could be

For the other thing: 另一方面:

mov word ptr scaling, dx
mov word ptr base, ax
call generate_fp_random_number
fimul word ptr scaling
fiadd word ptr base
fistp word ptr result  ; just save that thing
mov ax, word ptr result
; the default rounding mode is round to nearest,
; so the slow frndint is unnecessary

Also note the complete lack of ffree 's etc. By making the right instruction pop, it all just worked out. 还要注意完全没有ffree等。通过弹出正确的指令,一切都可以解决。 It usually does. 通常会这样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM