简体   繁体   English

OpenGL 上的 GPU 可能出现“内在函数”吗?

[英]"Intrinsics" possible on GPU on OpenGL?

I had this idea for something "intrinsic-like" on OpenGL, but googeling around brought no results.我对 OpenGL 上的“类似内在”的东西有这个想法,但谷歌搜索没有结果。

So basically I have a Compute Shader for calculating the Mandelbrot set (each thread does one pixel).所以基本上我有一个计算着色器来计算 Mandelbrot 集(每个线程处理一个像素)。 Part of my main-function in GLSL looks like this:我在 GLSL 中的部分主要功能如下所示:

float XR, XI, XR2, XI2, CR, CI;
uint i;
CR = float(minX + gl_GlobalInvocationID.x * (maxX - minX) / ResX);
CI = float(minY + gl_GlobalInvocationID.y * (maxY - minY) / ResY);
XR = 0;
XI = 0;
for (i = 0; i < MaxIter; i++)
{
    XR2 = XR * XR;
    XI2 = XI * XI;
    XI = 2 * XR * XI + CI;
    XR = XR2 - XI2 + CR;
    if ((XR * XR + XI * XI) > 4.0)
    {
        break;
    }
}

So my thought was using vec4 's instead of floats and so doing 4 calculations/pixels at once and hopefully get a 4x speed-boost (analog to "real" CPU-intrinsics).所以我的想法是使用vec4而不是floats ,因此一次进行 4 个计算/像素,并希望获得 4 倍的速度提升(类似于“真正的”CPU 内部函数)。 But my code seems to run MUCH slower than the float -version.但是我的代码似乎比float版本运行得慢得多。 There are still some mistakes in there (if anyone would still like to see the code, please say so), but I don't think they are what slows down the code.那里仍然有一些错误(如果有人仍然想看代码,请说出来),但我不认为它们是减慢代码速度的原因。 Before I try around for ages, can anybody tell me right away, if this endeavour is futile?在我尝试很久之前,任何人都可以立即告诉我,如果这种努力是徒劳的?

CPUs and GPUs work quite differently. CPU 和 GPU 的工作方式截然不同。

CPUs need explicit vectorization in the machine code, either coded explicitly by the programmer (through what you call 'CPU-intrisnics') or automatically vectorized by the compiler. CPU 需要在机器代码中进行显式矢量化,要么由程序员显式编码(通过您所说的“CPU-intrisnics”),要么由编译器自动矢量化。

GPUs, on the other hand, vectorize by means of running multiple invocations of your shader (aka kernel) on their cores.另一方面,GPU 通过在其核心上运行着色器(也称为内核)的多次调用来进行矢量化。

AFAIK, on modern GPUs, additional vectorization within a thread is neither needed nor supported: instead of manufacturing a single core that can add 4 floats at a time (for example), it's more beneficial to have four times as many simpler cores that can operate on a single float at a time each. AFAIK,在现代 GPU 上,既不需要也不支持线程内的额外矢量化:与其制造一个可以一次添加 4 个浮点数的内核(例如),不如拥有四倍的更简单的可以运行的内核更有益每次在一个浮子上。 The reason it's better is because for code working with vectors you'd still get the same throughput either way.它更好的原因是因为对于使用向量的代码,无论哪种方式,您仍然可以获得相同的吞吐量。 However, for code that operates on scalars, that extra circuitry implementing the vector instruction would not be wasted.然而,对于在标量上运行的代码,不会浪费实现向量指令的额外电路。 Instead, because it's split between multiple cores, it will be able to execute multiple instances of your shader.相反,因为它在多个内核之间拆分,所以它将能够执行着色器的多个实例。 Most code, by means of necessity, will have at least some scalar computations in it.大多数代码,根据需要,至少会有一些标量计算。

The bottom line is, that your code is already likely to utilize the GPU resources to their maximum.底线是,您的代码可能已经最大限度地利用了 GPU 资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM