简体   繁体   English

平铺渲染计算着色器光的剔除和着色

[英]Tiled rendering compute shader light culling and shading

I'm trying to implement tiled-deferred rendering in OpenGL/GLSL and I'm stuck on light culling. 我正在尝试在OpenGL / GLSL中实现延迟平铺渲染,但我坚持进行光剔除。

My GPU is kind of older (AMD Radeon 6490m) and for strange reasons compute shaders runs in infinite cycle when atomic operations are called inside them on shared variables so I couldn't compute minimum and maximum depth using compute shaders. 我的GPU较旧(AMD Radeon 6490m),由于奇怪的原因,当在共享变量上调用原子操作进行原子操作时,计算着色器会无限循环地运行,因此我无法使用计算着色器来计算最小和最大深度。 Anyway, it isn't much time-consuming operation so I do it in fragment shader. 无论如何,这不是很费时的操作,因此我在片段着色器中进行操作。

Then for every visible point light (in view space) I compute screen space bounding quad. 然后,对于每个可见光(在视图空间中),我计算屏幕空间边界四边形。 Now I want to use single compute shader for light culling and shading. 现在,我想使用单个计算着色器进行光剔除和着色。 Problem is that as mentioned above, I'm not able to use atomic operations on shared variables and hence I can't build tile light list and store light count for tile. 问题是如上所述,我无法在共享变量上使用原子操作,因此无法构建图块灯光列表并存储图块的灯光计数。

Problem is I cant' find any other way how to do this.Any idea how to cull & build tile light lists using non-atomics? 问题是我找不到其他方法可以做到这一点。任何想法如何使用非原子来剔除并建立瓷砖灯光列表?

Here is pseudo code of my compute shader: 这是我的计算着色器的伪代码:

#version 430

#define MAX_LIGHTS  1024
#define TILE_SIZE   32
#define RX  1280
#define RY  720

struct Light {
    vec4 position;
    vec4 quad;
    vec3 color;
    float radius;
}

uint getTilesXCount(){
    return uint(( RX + TILE_SIZE - 1) / TILE_SIZE);
}

uint getTilesYCount(){
    return uint((RY + TILE_SIZE - 1) / TILE_SIZE);
}

layout (binding = 0, rgba16f) uniform readonly image2D minMaxTex;
layout (binding = 1, rgba16f) uniform readonly image2D diffTex;
layout (binding = 2, rgba16f) uniform readonly image2D specTex;

layout (std430, binding = 3) buffer pointLights {
    Light Lights[];
};


//tile light list & light count
shared uint lightIDs[MAX_LIGHTS];
shared uint lightCount = 0;

uniform uint totalLightCount;

layout (local_size_x = TILE_SIZE, local_size_y = TILE_SIZE) in;

void main(void){

        ivec2 pixel = ivec2(gl_GlobalInvocationID.xy);
        vec2 tile = vec2(gl_WorkGroupID.xy * gl_WorkGroupSize.xy) / vec2(1280, 720);

        //get minimum & maximum depth for tile
        vec2 minMax = imageLoad(minMax, tile).xy;

        uint threadCount = TILE_SIZE * TILE_SIZE;
        uint passCount = (totalLightCount + threadCount - 1) / threadCount; 

        for(uint i = 0; i < passCount; i++){

            uint lightIndex = passIt * threadCount + gl_LocalInvocationIndex;

            // prevent overrun by clamping to a last ”null” light
            lightIndex = min(lightIndex, numActiveLights);

            Light l = pointLights[lightIndex];

            if(testLightBounds(pixel, l.quad)){

                if ((minMax.y < (l.position.z + l.radius))
                    && 
                    (minMax.x > (l.position.z - l.radius))){


                    uint index;
                    index = atomicAdd(lightCount, 1);
                    pointLightIndex[index] = lightIndex;
                }
            }
        }

    barrier();

    //do lighting for actual tile
    color = doLight();

    imageStore(out, pos, color);
}

I haven't really implemented tiled deferred, but I think you can approach this in a way similar to building a particle neighboring list for a simulation. 我还没有真正实现平铺延迟,但是我认为您可以采用类似于为模拟构建粒子相邻列表的方式来实现。

  • Have your compute shader build a tuple containing the light and cell id and store it in a buffer using the current thread as index. 让您的计算着色器构建一个包含光源和单元格ID的元组,并使用当前线程作为索引将其存储在缓冲区中。
  • Sort that buffer by cell id using your favourite GPU algorithm (radix sort or bitonic sort). 使用您喜欢的GPU算法(基数排序或双音排序)按单元ID对缓冲区进行排序。
  • Once your buffer is sorted, build an histogram and do a prefix sum scan in order to find where each of the cells start within the buffer. 对缓冲区进行排序后,构建直方图并进行前缀和扫描,以查找每个单元格在缓冲区中的起始位置。

Ex. 例如

 (Cell, Light) 1st pass: Cell Buffer -> [ 23, 0 ] [ 7, 1 ] [ 9, 2 ] .... 2nd pass: Cell Buffer -> [ 7, 1 ] [ 9, 2 ] [ 23, 0 ] .... (Start, End) 3rd pass: Index Buffer -> [0 0] [0 0] [0 0] [0 0] [0 0] [0 0] [0 1] [1 1] [1 2] ... 

For more details, the method is described in Simon Green's "Particle simulation using CUDA": http://idav.ucdavis.edu/~dfalcant/downloads/dissertation.pdf 有关更多详细信息,请参见Simon Green的“使用CUDA进行粒子仿真”中的方法: http : //idav.ucdavis.edu/~dfalcant/downloads/dissertation.pdf

The original method assumes that a particle can only be placed within a single cell, but you should be able to workaround this easily by using a bigger workload. 原始方法假定粒子只能放置在单个单元格中,但是您应该能够通过使用更大的工作量轻松解决此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM