优化GPU利用处理离散图像的技术

Question

I have a server which is applying filters (implemented as OpenGL shaders) to images.我有一台服务器正在对图像应用过滤器（实现为 OpenGL 着色器）。 They are mostly direct colour mappings but also occasionally blurs and other convolutions.它们大多是直接颜色映射，但偶尔也会出现模糊和其他卷积。

The source images are PNGs and JPGs in a variety of sizes from eg 100x100 pixels upto 16,384x16,384 (texture size for my GPU).源图像是各种尺寸的 PNG 和 JPG，例如从 100x100 像素到 16,384x16,384（我的 GPU 的纹理大小）。

The pipeline is:管道是：

Decode image to RGBA (CPU)
        |
        V
Load texture to GPU
        |
        V
   Apply shader (GPU)
        |
        V
Unload to CPU memory
        |
        V
  Encode to PNG (CPU)

The mean GPU timings are approx 0.75ms to load, 1.5ms to unload and 1.5 ms to process a texture.平均 GPU 时间加载大约 0.75 毫秒，卸载大约 1.5 毫秒，处理纹理大约需要 1.5 毫秒。

I have multiple CPU threads decoding PNGs and JPGs to provide a continuous stream of work to the GPU.我有多个 CPU 线程解码 PNG 和 JPG 以向 GPU 提供连续的 stream 工作。

The challenge is that watch -n 0.1 nvidia-smi reports that the GPU utilisation is largely about 0% - 1%, spiking to 18% periodically.挑战在于watch -n 0.1 nvidia-smi报告说 GPU 利用率在很大程度上约为 0% - 1%，周期性地飙升至 18%。

I really want to be getting more value out of the GPU, ie I'd like to see it's load at least around 50%.我真的想从 GPU 中获得更多价值，即我希望看到它的负载至少在 50% 左右。 My questions:我的问题：

Is nvidia-smi giving a reasonable representation of how busy the GPU is? nvidia-smi是否合理地表示了 GPU 的繁忙程度？ Does it for example include time to load and unload textures?例如，它是否包括加载和卸载纹理的时间？ If not, is there a better metric I could be using.如果没有，是否有更好的指标我可以使用。
Assuming that it is, and the GPU is sitting back doing nothing, are there any well understood architectures for increasing throughput?假设是这样，并且 GPU 无所事事，是否有任何易于理解的架构来提高吞吐量？ I've considered tiling multiple images into a large texture but this feels like it'll blow out CPU usage rather than GPU.我考虑过将多个图像平铺成一个大纹理，但这感觉就像它会破坏 CPU 使用率而不是 GPU。
Is there someway I could be loading the next image to GPU texture memory while the GPU is processing the previous image?有没有办法在 GPU 正在处理上一张图像时将下一张图像加载到 GPU 纹理 memory ？

Answer 1

Sampling nvidia-smi is a really poor way of figuring out utilization. nvidia-smi进行抽样是确定利用率的一种非常糟糕的方法。 Use Nvidia Visual Profiler (I find this easiest to work with) or Nvidia Nsight to get a true picture of what your performance and bottlenecks are.使用Nvidia Visual Profiler （我发现这个最容易使用）或Nvidia Nsight来真实了解您的性能和瓶颈。

It's hard to say how to improve performance without seeing your code and without you having a better understanding of what the bottleneck is.很难说如何在没有看到代码并且没有更好地理解瓶颈是什么的情况下提高性能。

You say you have multiple CPU threads going, but do you have multipleCUDA streams so you can hide the latency of data transfer?您说您有多个 CPU 线程在运行，但是您是否有多个CUDA 流，这样您就可以隐藏数据传输的延迟？ This allows you to load data into the GPU while it is processing.这允许您在 GPU 处理时将数据加载到它。
Are you sure you have threads and not processes?你确定你有线程而不是进程吗？ Threads might reduce overhead.线程可能会减少开销。
Applying a single shader on the GPU will take almost no time, so your pipeline might ultimately be limited by your hard drive's speed or your bus speed.在 GPU 上应用单个着色器几乎不需要任何时间，因此您的管道最终可能会受到硬盘驱动器速度或总线速度的限制。 Have you looked up this specs, measured the size of your images, and found a theoretical value for your maximum processing capability?您是否查看过这些规格，测量了图像的大小，并找到了最大处理能力的理论值？ Your GPU is likely to spend a lot of time being idle unless you're doing a lot of complicated math on it.您的 GPU 可能会花费大量时间处于空闲状态，除非您对其进行大量复杂的数学运算。

优化GPU利用处理离散图像的技术

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-11-06 23:48:13

优化GPU利用处理离散图像的技术

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-11-06 23:48:13

解决方案1
2 已采纳 2019-11-06 23:48:13