树莓派半精度浮点上的gcc（binary16，alternative，__ fp16）使用库函数

Question

I some machine learning based algorihm on a raspberry pi 3 with huge arrays of stored coefficients, that do not need full float32 precision. 我在树莓派3上基于机器学习的算法基于algorihm，具有庞大的存储系数数组，不需要完全的float32精度。

I tried to use half precision floating point for storing this data to reduce the programs memory (and maybe memory bandwidth) footprint. 我试图使用半精度浮点数来存储此数据，以减少程序的内存（也许还有内存带宽）占用空间。

The rest of the algorithm stays the same. 其余算法保持不变。

Comparing the float32 with the float16 version i got a (significant: + 33% runtime of my test program) performance loss when using __fp16 , although the conversion should be supported by the cpu. 使用时性能损失：（我的测试程序的+ 33％运行显著）的FLOAT32与float16版本比较我有一个__fp16 ，虽然转换应该由CPU来支持。

I took a look into the asembler output and also created a sinple function that just reads a __fp16 value and returns a it as float and it seems that some library function call is used for the the conversion. 我查看了汇编程序的输出，还创建了一个正弦函数，该函数仅读取__fp16值并将其返回为float值，并且似乎使用了一些库函数调用来进行转换。 (the same function is called than in the actual code) （与实际代码中调用的函数相同）

The rapspberry's cpu should have half precision hardware support, so I expected to see some instruction loading the data and to not see any performance-impact (or see improvement due to reduced memory bandwidth requiremenents) rapspberry的cpu应该具有一半精度的硬件支持，因此我希望看到一些指令正在加载数据，并且看不到任何性能影响（或者由于内存带宽需求减少而看到了改进）

I am using the following compiler-flags: 我正在使用以下编译器标志：

-O3 -mfp16-format=alternative -mfpu=neon-fp16 -mtune=cortex-a53 -mfpu=neon

here the small piece of code and the assembler outputs for the little test function: 这里是一小段代码，汇编程序输出了一点测试功能：

const float test(const Coeff *i_data, int i ){
  return (float)(i_data[i]);
}

using float for Coeff : 对Coeff使用float ：

    .align  2
    .global test
    .syntax unified
    .arm
    .fpu neon
    .type   test, %function
test:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    add r1, r0, r1, lsl #2  @ tmp118, i_data, i,
    vldr.32 s0, [r1]    @, *_5
    bx  lr  @

using __fp16 for Coeff ( -mfp16-format=alternative ): 使用__fp16表示Coeff （ -mfp16-format=alternative ）：

    .align  2
    .global test
    .syntax unified
    .arm
    .fpu neon
    .type   test, %function
test:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    lsl r1, r1, #1  @ tmp118, i,
    push    {r4, lr}    @
    ldrh    r0, [r0, r1]    @ __fp16    @, *_5
    bl  __gnu_h2f_alternative   @
    vmov    s0, r0  @,
    pop {r4, pc}    @

using __fp16 for Coeff ( -mfp16-format=ieee ): 使用__fp16表示Coeff （ -mfp16-format=ieee ）：

    .align  2
    .global test
    .syntax unified
    .arm
    .fpu neon
    .type   test, %function
test:
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    lsl r1, r1, #1  @ tmp118, i,
    push    {r4, lr}    @
    ldrh    r0, [r0, r1]    @ __fp16    @, *_5
    bl  __gnu_h2f_ieee  @
    vmov    s0, r0  @,
    pop {r4, pc}    @

Have I missed something? 我错过了什么吗？

Answer 1

The compiler flag -mfpu=neon overrides the earlier -mfpu=neon-fp16 since -mfpu= can only be specified once. 编译器标志-mfpu=neon会覆盖早期的-mfpu=neon-fp16因为-mfpu=只能指定一次。

It was a mistake that it was set twice (it was added in a different place in the Makefile). 两次设置（将其添加到Makefile中的其他位置）是一个错误。

But since the raspberry 3 has a vfpv4 that always has fp16 support, the best specification is -mfpu=neon-vfpv4 . 但是，由于树莓3具有始终支持fp16的vfpv4，因此最佳规范是-mfpu=neon-vfpv4 。

In this case no library calls are generated by the compiler for the conversion. 在这种情况下，编译器不会为转换生成任何库调用。

edit: according to this ghist -mfpu=neon-fp-armv8 -mneon-for-64bits can be used for Raspberry 3. 编辑：根据此ghist -mfpu=neon-fp-armv8 -mneon-for-64bits可用于Raspberry 3。

Answer 2

On ARM's site: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0774d/chr1421838476257.html 在ARM的网站上： http : //infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0774d/chr1421838476257.html

Note The __fp16 type is a storage format only. 注意__fp16类型仅是一种存储格式。 For purposes of arithmetic and other operations, __fp16 values in C or C++ expressions are automatically promoted to float. 为了进行算术运算和其他运算，C或C ++表达式中的__fp16值会自动提升为浮点型。

树莓派半精度浮点上的gcc（binary16，alternative，__ fp16）使用库函数

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-01-01 17:09:12

解决方案2
1 2018-12-31 08:21:47

树莓派半精度浮点上的gcc（binary16，alternative，__ fp16）使用库函数

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-01-01 17:09:12

解决方案2 1 2018-12-31 08:21:47

解决方案1
2 已采纳 2019-01-01 17:09:12

解决方案2
1 2018-12-31 08:21:47