简体   繁体   English

为什么这些Google图像处理示例Renderscript在Nexus 5中的GPU上运行速度较慢

[英]Why does those Google image processing sample Renderscript runs slower on GPU in Nexus 5

I'd like to thank Stephen for the very quick reply in a previous post. 我要感谢Stephen在上一篇文章中的快速回复。 This is a follow up question for this post Why very simple Renderscript runs 3 times slower in GPU than in CPU 这是本文的后续问题为什么非常简单的Renderscript在GPU中运行速度比在CPU中慢3倍

My dev platform is as follows 我的开发平台如下

Development OS: Windows 7 32-bit
Phone: Nexus 5
Phone OS version: Android 4.4
SDK bundle: adt-bundle-windows-x86-20131030
Build-tool version: 19
SDK tool version: 22.3
Platform tool version: 19

In order to evaluate the performance of Renderscript GPU compute and to grasp the general trick of making code faster by Renderscript, I did the following test. 为了评估Renderscript GPU计算的性能并掌握Renderscript使代码更快的一般技巧,我做了以下测试。

I checked out the code from Google's android open source project, using tag android-4.2.2_r1.2 . 我使用标签android-4.2.2_r1.2检查了Google的Android开源项目中的代码。 I used this tag simply because the ImageProcessing test sample is not available in the newer version. 我之所以使用这个标签,是因为ImageProcessing测试样本在较新版本中不可用。

Then I used the project under "base\\tests\\RenderScriptTests\\ImageProcessing" in the test. 然后我在测试中使用了“base \\ tests \\ RenderScriptTests \\ ImageProcessing”下的项目。 I recorded the performance of running code on GPU as well CPU and the performance is listed below. 我在GPU上记录了运行代码的性能以及CPU,性能如下所示。

                         GPU    CPU
Levels Vec3 Relaxed     7.45ms  14.89ms
Levels Vec4 Relaxed     6.04ms  12.85ms
Levels Vec3 Full        N/A     28.97ms
Levels Vec4 Full        N/A     35.65ml
Blur radius 25          203.2ms 245.60ms
Greyscale               7.16ms  11.54ms
Grain                   33.33ms 21.73ms
Fisheye Full            N/A     51.55ms
Fisheye Relaxed         92.90ms 45.34ms
Fisheye Approx Full     N/A     51.65ms
Fisheye Approx Relaxed  93.09ms 39.11ms
Vignette Full           N/A     44.17ms
Vignette Relaxed        8.02ms  46.68ms
Vignette Approx Full    N/A     45.04ms
Vignette Approx Relaxed 8.20ms  43.69ms
Convolve 3x3            37.66ms 16.81ms
Convolve 3x3 Intrinsics N/A     4.57ms
ColorMatrix             5.87ms  8.26ms
ColorMatrix Intrinsics  N/A     2.70ms
ColorMatrix Intinsics Grey  N/A 2.52ms
Copy                    5.59ms  2.40ms
CrossProcess(using LUT) N/A     5.74ms
Convolve 5x5            84.25ms 46.59ms
Convolve 5x5 Intrinsics N/A     9.69ms
Mandelbrot              N/A     50.2ms
Blend Intrinsics        N/A     21.80ms

The N/A in the table is caused by either full precision or rs intrinsics doesn't running on GPU. 表中的N / A是由完全精度或rs内在函数在GPU上运行引起的。 We can see that among 13 algorithms running on GPU, 6 of them runs slower on GPU. 我们可以看到,在GPU上运行的13种算法中,其中6种在GPU上运行较慢。 Since such code was written by Google, I'd consider this phenomenon is somewhat worth investigating. 由于此类代码是由Google编写的,因此我认为这种现象值得研究。 At least, "I assume the code will run faster on the GPU" I saw from Renderscript and the GPU doesn't hold here. 至少,“我假设代码将在GPU上运行得更快”我从Renderscript看到并且GPU不在这里。

I investigated some of the algorithms in the list, I'd like to mention two. 我调查了列表中的一些算法,我想提两个。

In Vignette, the performance on GPU is much better, I found this was used by invoking several functions within rs_cl.rsh. 在Vignette中,GPU上的性能要好得多,我发现这是通过调用rs_cl.rsh中的几个函数来使用的。 If I comment out those functions, CPU will run faster (see my previous question on the top for an extreme case). 如果我注释掉这些功能,CPU将运行得更快(在极端情况下请参阅我之前的问题)。 So the question is why this happens. 所以问题是为什么会发生这种情况。 In rs_cl.rsh, most of the functions are math related, eg exp, log, cos, etc. Why such function runs a lot faster on GPU, is this because the implementation of those functions are actually high paralleled or just because the implementation of the version runs on GPU is better than the version runs on CPU? 在rs_cl.rsh中,大多数函数都是数学相关的,例如exp,log,cos等。为什么这样的函数在GPU上运行得快得多,这是因为这些函数的实现实际上是高并行的,或者仅仅因为执行在GPU上运行的版本比在CPU上运行的版本更好?

Another example is conv3x3 and conv5x5. 另一个例子是conv3x3和conv5x5。 Though there're other more clever implementation than Google's version in this test app, I think this implementation by Google is certainly not bad. 虽然在这个测试应用程序中还有比Google版本更聪明的实现,但我认为Google的这种实现肯定不错。 It tries to minimize the addition operation and uses some facilitation function from rs_cl.rsh such as convert_float4(). 它尝试最小化加法运算并使用rs_cl.rsh中的一些简化函数,例如convert_float4()。 So at a glance, I assume it will run faster on GPU. 所以一目了然,我认为它会在GPU上运行得更快。 However, it runs a lot slower (on Nexus 4 and 5 both using Qualcomm's GPU). 但是,它运行速度要慢得多(在Nexus 4和5上都使用Qualcomm的GPU)。 I think this example is very representative since in the implementation, the algorithm needs to access the pixels near the current pixel. 我认为这个例子非常具有代表性,因为在实现中,算法需要访问当前像素附近的像素。 Such operation is quite common in many image processing algorithms. 这种操作在许多图像处理算法中非常普遍。 If the implementation like 2D convolution can't be made faster in GPU, I suspect there're many other algorithms would suffer the same. 如果像2D卷积这样的实现在GPU中不能更快,我怀疑还有很多其他算法会受到同样的影响。 It would be highly appreciated if you can identify where the problem is and suggest some ways to make such algorithms faster. 如果您能够确定问题所在并提出一些方法来更快地制定此类算法,我们将非常感激。

The more general question is, given the test result I showed, I'd like to ask what kind of criterions people should follow to get the higher performance and avoid the performance degradation as much as possible. 更一般的问题是,根据我展示的测试结果,我想问一下人们应该遵循什么样的标准来获得更高的性能并尽可能地避免性能下降。 After all, the goal of performance is the second most important goal of Renderscript and I think the portability of RS is quite good. 毕竟,性能的目标是Renderscript的第二个最重要的目标,我认为RS的可移植性非常好。

Thank you! 谢谢!

There are really two answers to this question. 这个问题确实有两个答案。

1: Don't believe the hype regarding GPUs. 1:不要相信有关GPU的炒作。 For some workloads they are faster. 对于某些工作负载,它们更快。 However, for many workloads, the difference is small or negative. 但是,对于许多工作负载而言,差异很小或是负面的。 You have at least 2 different processor types, don't worry about which one get used, only worry if the performance is what you want. 您至少有2种不同的处理器类型,不用担心使用哪种类型,只有在性能达到您想要的时候才会担心。

2: For performance tuning I would really focus on the algorithm and avoiding slow operations. 2:对于性能调优,我会专注于算法并避免慢速操作。 Examples: 例子:

  • Prefer float to double when float provides adequate precision. 当浮子提供足够的精度时,首选浮动加倍。

  • Use RS_FP_RELAXED when you don't need IEEE-754 compliance 如果不需要IEEE-754,请使用RS_FP_RELAXED

  • Prefer multiplication to division 喜欢乘法除法

  • use native_* (ex: native_powr) in place of the full precision routines where the precision is adequate 使用native_ *(例如:native_powr)代替精度足够的完整精度例程

  • Use rsGetElementAt_* over rsSample or rsGetElementAt. 在rsSample或rsGetElementAt上使用rsGetElementAt_ *。 The typed version of get are faster that the general get and much faster than rsSample in many cases. 在许多情况下,get的类型版本比一般获取更快,并且比rsSample快得多。

  • loads from script globals are typically faster than loads from an rs_allocation. 来自脚本全局变量的加载通常比来自rs_allocation的加载更快。 Prefer global for kernel constants. 首选内核常量的全局。

3: There are some performance issues with global loads today on the Nexus (4,5,7v2) GPU path. 3:目前Nexus(4,5,7v2)GPU路径上的全局负载存在一些性能问题。 These will be improved with updates. 这些将通过更新得到改进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么非常简单的Renderscript在GPU中的运行速度比CPU慢3倍 - Why very simple Renderscript runs 3 times slower in GPU than in CPU 如果上一步在GPU上运行,则Renderscript内部函数会减缓管道的运行 - Renderscript intrinsics slower down the pipeline if the previous step runs on GPU 如何在nexus 5中的GPU上运行renderscript - how to run renderscript on GPU in nexus 5 Android renderscript永远不会在GPU上运行 - Android renderscript never runs on the gpu Renderscript Image Processign Nexus 6棉花糖问题 - Renderscript image processign Nexus 6 Marshmallow issue 市场上的Nexus 5,Nexus7和Nexus10设备是否直接支持RenderScript GPU计算? - Do Nexus 5, Nexus7 and Nexus10 device on the market directly support RenderScript GPU compute? 为什么 NDK 在非并行化操作上比 Renderscript 慢? - Why is NDK slower then Renderscript on a non parallelizable operation? 在简单的数组添加上,RenderScript 的运行速度比 Kotlin 慢 - RenderScript runs slower than Kotlin on simple array adding Renderscript pow()、powr() 和 pown() 在使用 Android 4.4 和 Android 4.4.1 的 Nexus 5 中的 GPU 上非常慢 - Renderscript pow(), powr() and pown() are very slow on GPU in Nexus 5 with Android 4.4 and Android 4.4.1 Google关于RenderScript的代码不起作用? - the Google's code about RenderScript does not work?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM