简体   繁体   English

OpenCL-减少工作项目的数量

[英]Opencl - reduction in the rate of the number of work-item

I'm trying calculate crc32 for multithread. 我正在尝试为多线程计算crc32。 I'm trying use OpenCL. 我正在尝试使用OpenCL。 The GPU code is: GPU代码为:

__kernel void crc32_Sarwate( __global int* lenghtIn, 
                         __global unsigned char *In, 
                         __global int *OutCrc32,
                            int size ) {
int i, j, len;

i = get_global_id( 0 ); 
if( i >= size )
    return;
len = j = 0;
while( j != i )
    len += lenghtIn[ j++ ];
OutCrc32[ i ] = crc32( In + len, lenghtIn[ i ] ); }

I received this results( time ) with a thousand repetitions: 我收到了1000次重复的结果(时间):
for 4 using work-item: 29.82 4个使用工作项目:29.82
for 8 using work-item: 29.9 8个使用工作项目:29.9
for 16 using work-item: 35.16 16个使用工作项目:35.16
for 32 using work-item: 35.93 32个使用工作项目:35.93
for 64 using work-item: 38.69 64个使用工作项目:38.69
for 128 using work-item: 52.83 使用工作项目的128个:52.83
for 256 using work-item: 152.08 256个使用工作项目:152.08
for 512 using work-item: 333.63 512使用工作项目:333.63

I have intel HD Graphics with 350 MHz and 3 work-group with 256 work-item each work-group. 我有具有350 MHz的Intel HD Graphics和3个工作组,每个工作组有256个工作项。 I assumed that by increasing the number of work-item 128 to 256 happen slight increase in time, but time tripled. 我以为,通过将工作项目128的数量增加到256,时间会略有增加,但是时间增加了两倍。 Why? 为什么? ( I'm sorry for my very bad English ). (对不起,我的英语不好)。

The

while( j != i )
    len += lenghtIn[ j++ ];

part runs for get_global_id( 0 ) times. 该部分运行get_global_id(0)次。

When it is 128, the latest work item to complete is doing 128 loop iterations. 当它是128时,要完成的最新工作项正在执行128个循环迭代。

When it is 256, it is doing 256 iterations so it should be %100 increase from memory's point of view but only for the last work item. 当它是256时,它将执行256次迭代,因此从内存的角度来看应该增加%100,但仅适用于最后一个工作项。 When we integrate all workers' total memory access numbers, 当我们整合所有工作人员的总内存访问数量时,

 1 item from 0 to 0 ---> 1 access
 2 item from 0 to 0 and 0 to 1  ---> 3 access
 4 item from 0 to 0 and 0 to 1 and 0 to 2 and 0 to 3---> 10 access
 8 items: SUM(1 to 8) => 36 accesses  
 16 items: SUM(1 to 16) => 136 accesses (even more than + %200)
 32 items: => 528 (~ %400)
 64 items: => 2080 ( ~%400)
 128 items: => 8256 (~%400) (cache of your igpu starts failing here)
 256 items: => 32896 (~400%) (now caching is saturated and you start )
                              ( seeing %400 per doubling of work items)

 512 => uses second compute unit too! But %400 work is done 
  so it is not only %200 time consuming.

so each time you increase work items by %100, you increase total memory accesses to %400 . 因此,每次将工作项增加%100时,总的内存访问量就会增加到%400。 But caching helps up to some degree. 但是缓存在某种程度上有所帮助。 When you cross that, memory accesses increase badly. 当您遇到这种情况时,内存访问会严重增加。 Alse the execution overhead(drivers,..) becomes unimportant. 另外,执行开销(驱动程序,..)变得不重要。

You are accessing to memory non-parallel. 您正在非并行访问内存。 You need to cache it first but it may not be possible in that hardware so you should distribute the job equally among workitems and make memory accesses contiguous between cores(vectorize). 您需要先对其进行缓存,但是在该硬件中可能无法实现,因此您应该在工作项之间平均分配作业,并使内存访问在内核之间是连续的(向量化)。 This should give more performance. 这应该提供更多的性能。

For now, each vector unit does: 目前,每个向量单位都可以:

unit        :   v0 v1 v2 v3 v4 ... v7
read address:   0  0  0  0  0      0
                -  1  1  1  1      1
                -  -  2  2  2      2
                -  -  -  3  3      3
                -  -  -  -  4      4
                     ....  
                -  -  -  -  - ...  7

done in 8 steps on 8 streaming cores. 在8个流式内核上分8个步骤完成。

At the last step, only single work item is actually computing something. 在最后一步,只有单个工作项实际上在计算某些内容。 This should be something like: 这应该是这样的:

Some Optimization 一些优化

unit        :   v0 v1 v2 v3  no need other work items
read address:   0  0  0  0  \
                1  1  1  1   \
                2  2  2  2    \
                3  3  3  3    / this is 5th work item's work
                4  4  4  4   /
                5  5  5  0   \
                6  6  0  1    \ this is 0 to 3 as 4th work
                7  0  1  2    /
 first item<--  0  1  2  3   /

done in 8 steps in only 4 streaming cores and is doing same job for the first half part(probably faster). 仅用4个流核心完成了8个步骤,并且在上半部分所做的工作相同(可能更快)。

Further Optimization Suggestion 进一步优化建议

I think it would be better with a prefix-scan(sum) algorithm on another kernel before getting to crc32() part. 我认为在进入crc32()部分之前,在另一个内核上使用prefix-scan(sum)算法会更好。 (probably in just 3 steps for this example rather than 8 and also more efficient) (在本示例中,大概只需3个步骤,而不是8个步骤,而且效率更高)

Using precomputed values of 使用预先计算的值

while( j != i )
        len += lenghtIn[ j++ ];

should make crc32 immune to the current algorithm complexity (O(n²)). 应该使crc32不受当前算法复杂度(O(n²))的影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM