简体   繁体   English

iPhone ARMv6 VFP组件延迟,吞吐量和危害

[英]iPhone ARMv6 VFP asm latency, throughput and hazards

in this document: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301g/DDI0301G_arm1176jzfs_r0p7_trm.pdf 在本文档中: http : //infocenter.arm.com/help/topic/com.arm.doc.ddi0301g/DDI0301G_arm1176jzfs_r0p7_trm.pdf

on page 21-25 (pdf page 875) the througput and latency timings are given for the assembly instructions of the VFP unit. 在第21-25页(pdf第875页)中,给出了VFP单元组装说明的吞吐量和等待时间。

Are those numbers independant of vectorsize? 这些数字独立于vectorsize吗?

1: let's take FMULS which has throughput of 1 and latency of 8. does it mean that i can start in each cycle a new FMULS operation if i don't use a register which is not currently calculated by a previous function? 1:让我们以吞吐量为1且延迟为8的FMULS为例,这是否意味着如果我不使用当前未由上一个函数计算的寄存器,则可以在每个周期中开始一个新的FMULS操作吗? for example: 例如:

FMULS s8, s16, s20
FMULS s12, s21, s25

will those exectue right after each other? 那些会互相追逐的人吗?

2: what happens if I have two FMULS functions after each other where one argument depends upon the previous computation 2:如果我彼此有两个FMULS函数,其中一个参数取决于先前的计算,会发生什么情况

FMULS s8, s16, s20
FMULS s12, s21, s8

will the VFP wait for 8 cycles before starting to process the second instruction? VFP在开始处理第二条指令之前会等待8个周期吗?

3: what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. 3:如果我们处于具有4个元素的向量模式中,并且在第二条FMULS指令上,所有输入寄存器都可用,但一个可用,该怎么办? what will happen? 会发生什么?

4: sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles? 4:sqrt和除法:sqrt或除法运算是否会阻止任何后续操作在19个周期内启动?

thanks! 谢谢!

Your questions are all answered in the document that you linked. 您的问题都在您链接的文档中得到了回答。 You should read it carefully. 您应该仔细阅读。

Are those numbers independent of vectorsize? 这些数字是否独立于vectorsize?

No. See, for example, Table 21-15 in the document you linked. 否。例如,请参见所链接文档中的表21-15。 Note the latency of the short vector FADDS . 注意短向量FADDS的等待时间。

does it mean that I can start a new FMULS operation every cycle if it doesn't depend on an earlier result that isn't available yet? 这是否意味着如果不依赖尚不可用的较早结果,我可以在每个周期开始一个新的FMULS操作吗?

Yes, that's the definition of throughput. 是的,这就是吞吐量的定义。

what happens if I have two FMULS functions after each other where one argument depends upon the previous computation 如果我彼此有两个FMULS函数,其中一个参数取决于先前的计算,会发生什么情况

Execution will stall until the result of the first FMULS is available. 执行将暂停,直到第一个FMULS的结果可用为止。 See 21.6 "Operation of the scoreboards" for more detail. 有关更多详细信息,请参见21.6“记分板的操作”。

what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. 如果我们处于具有4个元素的向量模式下,并且在第二条FMULS指令上,所有输入寄存器都可用但一个可用,该怎么办? what will happen? 会发生什么?

It will stall. 它会停转。 Same reference. 相同的参考。

sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles? sqrt和除法:sqrt或除法运算是否会阻止任何后续操作开始19个周期?

No. See section 21.10 "Parallel Execution". 否。请参见第21.10节“并行执行”。 An example is given in Table 21-15, in which a non-dependent FADDS executes immediately following FDIVS . 表21-15中给出了一个示例,其中在FADDS立即执行不相关的FDIVS

Note that it can be a bit of a challenge (though not impossible) to write short-vector VFP code that performs substantially faster than scalar code for many types of computation. 请注意,对于许多类型的计算而言,编写短于标量代码的速度要快得多的短向量VFP代码可能是一个挑战(尽管并非不可能)。 Even if you learn how to do it, it will be of questionable value since the NEON unit seems to be the new model for vector computation on ARM. 即使您学习了如何做,它的价值也值得怀疑,因为NEON单元似乎是ARM上矢量计算的新模型。 You may be better served in the long run by ignoring the short-vector operation for now and focusing on learning NEON for the future. 从长远来看,您可能会更好,因为现在忽略短向量运算,而专注于将来学习NEON。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM