优化GLSL中的光线跟踪着色器

Question

I have coded a voxelization based raytracer which is working as expected but is very slow. 我编写了一个基于体素化的光线跟踪器，它按预期工作但速度非常慢。

Currently the raytracer code is as follows: 目前光线跟踪器代码如下：

#version 430 
//normalized positon from (-1, -1) to (1, 1)
in vec2 f_coord;

out vec4 fragment_color;

struct Voxel
{
    vec4 position;
    vec4 normal;
    vec4 color;
};

struct Node
{
    //children of the current node
    int children[8];
};

layout(std430, binding = 0) buffer voxel_buffer
{
    //last layer of the tree, the leafs
    Voxel voxels[];
};
layout(std430, binding = 1) buffer buffer_index
{
    uint index;
};
layout(std430, binding = 2) buffer tree_buffer
{
    //tree structure       
    Node tree[];
};
layout(std430, binding = 3) buffer tree_index
{
    uint t_index;
};

uniform vec3 camera_pos; //position of the camera
uniform float aspect_ratio; // aspect ratio of the window
uniform float cube_dim; //Dimenions of the voxelization cube
uniform int voxel_resolution; //Side length of the cube in voxels

#define EPSILON 0.01
// Detect whether a position is inside of the voxel with size size located at corner
bool inBoxBounds(vec3 corner, float size, vec3 position)
{
    bool inside = true;
    position-=corner;//coordinate of the position relative to the box coordinate system
    //Test that all coordinates are inside the box, if any is outisde, the point is out the box
    for(int i=0; i<3; i++)
    {
        inside = inside && (position[i] > -EPSILON);
        inside = inside && (position[i] < size+EPSILON);
    }

    return inside;
}

//Get the distance to a box or infinity if the box cannot be hit
float boxIntersection(vec3 origin, vec3 dir, vec3 corner0, float size)
{
    dir = normalize(dir);
    vec3 corner1 = corner0 + vec3(size,size,size);//Oposite corner of the box

    float coeffs[6];
    //Calculate the intersaction coefficients with te 6 bonding planes 
    coeffs[0] = (corner0.x - origin.x)/(dir.x);
    coeffs[1] = (corner0.y - origin.y)/(dir.y);
    coeffs[2] = (corner0.z - origin.z)/(dir.z);

    coeffs[3] = (corner1.x - origin.x)/(dir.x);
    coeffs[4] = (corner1.y - origin.y)/(dir.y);
    coeffs[5] = (corner1.z - origin.z)/(dir.z);
    //by default the distance to the box is infinity
    float t = 1.f/0.f;

    for(uint i=0; i<6; i++){
        //if the distance to a boxis negative, we set it to infinity as we cannot travel in the negative direction
        coeffs[i] = coeffs[i] < 0 ? 1.f/0.f : coeffs[i];
        //The distance is the minumum of the previous calculated distance and the current distance
        t = inBoxBounds(corner0,size,origin+dir*coeffs[i]) ? min(coeffs[i],t) : t;
    }

    return t;
}

#define MAX_TREE_HEIGHT 11
int nodes[MAX_TREE_HEIGHT];
int levels[MAX_TREE_HEIGHT];
vec3 positions[MAX_TREE_HEIGHT];
int sp=0;

void push(int node, int level, vec3 corner)
{
    nodes[sp] = node;
    levels[sp] = level;
    positions[sp] = corner;
    sp++;
}

void main()
{   
    int count = 0; //count the iterations of the algorithm
    vec3 r = vec3(f_coord.x, f_coord.y, 1.f/tan(radians(40))); //direction of the ray
    r.y/=aspect_ratio; //modify the direction based on the windows aspect ratio
    vec3 dir = r;
    r += vec3(0,0,-1.f/tan(radians(40))) + camera_pos; //put the ray at the camera position

    fragment_color = vec4(0);
    int max_level = int(log2(voxel_resolution));//height of the tree
    push(0,0,vec3(-cube_dim));//set the stack
    float tc = 1.f; //initial color value, to be decreased whenever a voxel is hit
    //tree variables
    int level=0;
    int node=0;
    vec3 corner;

    do
    {
        //pop from stack
        sp--;
        node = nodes[sp];
        level = levels[sp];
        corner = positions[sp];

        //set the size of the current voxel 
        float size = cube_dim / pow(2,level);
        //set the corners of the children
        vec3 corners[] =
            {corner,                        corner+vec3(0,0,size),
            corner+vec3(0, size,0),         corner+vec3(0,size,size),
            corner+vec3(size,0,0),          corner+vec3(size,0,size),
            corner+vec3(size,size,0),       corner+vec3(size,size,size)};

        float coeffs[8];
        for(int child=0; child<8; child++)
        {
            //Test non zero childs, zero childs are empty and thus should be discarded
            coeffs[child] = tree[node].children[child]>0?
                //Get the distance to your child if it's not empty or infinity if it's empty
                boxIntersection(r, dir, corners[child], size) : 1.f/0.f;
        }
        int indices[8] = {0,1,2,3,4,5,6,7};
        //sort the children from closest to farthest
        for(uint i=0; i<8; i++)
        {
            for(uint j=i; j<8; j++)
            {
                if((coeffs[j] < coeffs[i]))
                {
                    float swap = coeffs[i];
                    coeffs[i] = coeffs[j];
                    coeffs[j] = swap;

                    int iSwap = indices[i];
                    indices[i] = indices[j];
                    indices[j] = iSwap;

                    vec3 vSwap = corners[i];
                    corners[i] = corners[j];
                    corners[j] = vSwap;
                }
            }
        }
        //push to stack
        for(uint i=7; i>=0; i--)
        {
            if(!isinf(coeffs[i]))
            {
                push(tree[node].children[indices[i]],
                    level+1, corners[i]);
            }
        }
        count++;
    }while(level < (max_level-1) && sp>0);
    //set color
    fragment_color = vec4(count)/100;
}

As it may not be fully clear what this does, let me explain. 由于可能不完全清楚这是做什么的，让我解释一下。

We check ray-box intersections starting with a big cube. 我们检查从一个大立方体开始的光线盒交叉点。 If we hit it we test intersection with the 8 cubes that compose it. 如果我们点击它，我们测试与组成它的8个立方体的交集。

If we hit any fo those we check intersections with the 8 cubes that make up that cube. 如果我们击中那些，我们检查与构成该立方体的8个立方体的交叉点。

In 2D this would look as follows: 在2D中，这将看起来如下：

In this case we have 4 layers, we check the big box first, then the ones colored in red, then the ones colored in green, and finally the ones colored in blue. 在这种情况下，我们有4层，我们首先检查大盒子，然后检查红色，然后是绿色，最后是蓝色。

Printing out the number of times the raytracing step executed as a color (which is what the code snippet I have provided does) 打印出光线追踪步骤作为颜色执行的次数（这是我提供的代码片段）

results in the following image: 得到以下图像：

As you can see, most of the time the shader doesn't do more than 100 iterations. 如您所见，大多数情况下着色器的迭代次数不超过100次。

However this shader takes 200 000 microseconds to execute on average in a gtx 1070. 然而，这个着色器在gtx 1070中平均执行200 000微秒。

Since the issue is not number of executions, my problem is likely to be on thread execution. 由于问题不是执行次数，我的问题可能是线程执行。

Does anyone know how I could optimize this code? 有谁知道如何优化这段代码？ The biggest botttleneck seems to be the use of a stack. 最大的底线似乎是堆栈的使用。

If I run the same code without pushing to the stack (generating wrong output), there is a 10 fold improvement in runtime 如果我在不推送到堆栈的情况下运行相同的代码（生成错误的输出），则运行时间会有10倍的改进

Answer 1

It seems you test for intersection with the ray most of all voxels in each level of the octree. 看起来你测试的是与八角形的每个级别中的大多数体素的光线相交。 And sort them (by some distance) also in each level. 并在每个级别中对它们进行排序（相隔一段距离）。 I propose another approach. 我提出另一种方法。

If the ray intersects with the bounding box (level 0 of the octree) it makes it at two faces of the box. 如果光线与边界框（八叉树的0级）相交，则它在框的两个面上。 Or in a corner or an edge, these are "corner" cases. 或者在角落或边缘，这些是“角落”的情况。

Finding the 3D ray-plane intersection can be done like here . 可以像这里一样找到3D射线平面交点。 Finding if the intersection is inside the face (quad) can be done by testing if the point is inside of one of the two triangles of the face, like here . 找到交叉点是否在面内（四边形）可以通过测试该点是否在面的两个三角形之一内来完成，就像这里一样。

Get the farthest intersection I0 from the camera. 从相机获取最远的交叉点I0 。 Also let r be a unit vector of the ray in the direction I0 toward the camera. 还让r是朝向摄像机的方向I0上的光线的单位矢量。

Find the deepest voxel for I0 coordinates. 找到I0坐标最深的体素。 This is the farthest voxel from the camera. 这是相机中最远的体素。

Now we want the exit-coordinates I0e for the ray in that voxel, through another face. 现在，我们希望通过另一个面，该体素中的光线的出口坐标I0e 。 While you could do again the calculations for all 6 faces, if your voxels are X,Y,X aligned and you define the ray in the same coordinates system as the octree, then calculae simplify a lot. 虽然您可以再次对所有6个面进行计算，但如果您的体素是X，Y，X对齐并且您在与八叉树相同的坐标系中定义光线，那么计算会简化很多。

Apply a little displacement (eg a 1/1000 of the smallest voxel size) to I0e by the r unit vector of the ray: I1 = I0e + r/1000 . 通过射线的r单位矢量向I0e施加一点位移（例如，最小体素大小的1/1000）： I1 = I0e + r/1000 。 Find the voxel to these I1 . 找到这些I1的体素。 This is the next voxel in the sorted list of voxel-ray intersections. 这是体素射线交叉排序列表中的下一个体素。

Repeat finding I1e then I2 then I2e then I3 etc. until the bounding box is exited. 重复找到I1e然后I2然后I2e然后I3等，直到退出边界框。 The list of crossed voxels is sorted. 交叉体素列表已排序。

Working with the octree can be optimized depending on how you store its info: All possible nodes or just used. 使用八叉树可以根据您存储信息的方式进行优化：所有可能的节点或仅使用。 Nodes with data or just "pointers" to another container with the data. 带有数据的节点或只是带有数据的另一个容器的“指针”。 This is matter for another question. 这是另一个问题的问题。

Answer 2

The first thing that stands out is your box intersection function. 首先要突出的是你的盒子交叉功能。 Have a look at inigo quilez' procedural box function for a much faster version. 看看inigo quilez的程序框功能，以获得更快的版本。 Since your boxsize is uniform in all axes and you don't need outNormal, you can get an even lighter version. 由于你的盒子尺寸在所有轴上是均匀的，你不需要outNormal，你可以获得更轻的版本。 In essence, use maths instead of the brute force approach that tests each box plane. 实质上，使用数学而不是测试每个盒子平面的蛮力方法。

Also, try to avoid temporary storage where possible. 此外，尽量避免临时存储。 For example, the corners array could be computed on demand for each octree box. 例如，可以根据需要为每个八叉树盒计算角阵列。 Of course, with the above suggestion, these will be changed to box centers. 当然，根据上述建议，这些将改为盒中心。

Since nodes , levels and positions are always accessed together, try co-locating them in a new single struct and access them as a single unit. 由于nodes ， levels和positions总是一起访问，因此请尝试将它们放在一个新的单个结构中，并将它们作为一个单元进行访问。

Will look more later... 稍后再看......

Answer 3

Thread execution on a GPU may be massively parallel, but that doesn't mean that all threads run independently from one another. GPU上的线程执行可能是大规模并行的，但这并不意味着所有线程彼此独立运行。 Groups of threads execute exactly the same instructions, the only difference is the input data. 线程组执行完全相同的指令，唯一的区别是输入数据。 That means that branches and therefore loops can't make a thread diverge in execution and therefore also not let them terminate early. 这意味着分支和循环不能使线程在执行中发散，因此也不能让它们提前终止。

Your example shows the most extreme edge case of this: when there is a high likelyhood that in a group of threads all work that's done is relevant to one thread only. 您的示例显示了最极端的情况：当一组线程中存在很高的可能性时，所有已完成的工作仅与一个线程相关。

To alleviate this, you should try to reduce the difference in execution length (iterations in your case) for threads in a group (or in total). 为了缓解这种情况，您应该尝试减少组（或总体）中线程的执行长度（在您的情况下为迭代）的差异。 This can be done by setting a limit on the number of iterations per shader pass and rescheduling only those threads/pixels that need more iterations. 这可以通过设置每个着色器传递的迭代次数限制并仅重新调度那些需要更多迭代的线程/像素来完成。

优化GLSL中的光线跟踪着色器

问题描述

3 个解决方案

解决方案1
2 2018-06-12 18:09:02

解决方案2
2 2018-06-13 19:01:35

解决方案3
1 已采纳 2018-06-18 23:14:29

优化GLSL中的光线跟踪着色器

问题描述

3 个解决方案

解决方案1 2 2018-06-12 18:09:02

解决方案2 2 2018-06-13 19:01:35

解决方案3 1 已采纳 2018-06-18 23:14:29

解决方案1
2 2018-06-12 18:09:02

解决方案2
2 2018-06-13 19:01:35

解决方案3
1 已采纳 2018-06-18 23:14:29