OpenCL矩阵乘法速度

Question

I wrote a small OpenCL application which calculates the product of two matrices. 我编写了一个小的OpenCL应用程序，该应用程序计算两个矩阵的乘积。 Now I've noticed that if the size of the matrix exceeds 8192 x 8192 there is a significant performance drop (calculation for a 16384 x 16384 is ~80 times slower) and even the serial implementation is over 5 times faster. 现在，我注意到，如果矩阵的大小超过8192 x 8192，则会出现明显的性能下降（对于16384 x 16384的计算要慢80倍左右），甚至串行实现的速度也要快5倍以上。 Here is the host code: 这是主机代码：

/*Make some includes and definitions here*/
#include "stdafx.h"
#include <CL/cl.hpp>

#include <vector>
#include <iostream>

#include "util.hpp" // utility library

#define __CL_ENABLE_EXCEPTIONS
#define ROWS (16384)    // ROWS of vectors a, b, and c
#define COLUMNS (16384)

/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
#include "metrics.h"

/*Start main()*/

int main(void)
{
    int A;

    // Fill vectors X and Y with random float values

    float* h_x = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_x[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
        }
    }
    float* h_y = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_y[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
        }
    }
    float* h_s = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_s[j + i*COLUMNS] = 0.0;
        }
    }

    /*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/

    // Get all platforms (drivers)

    std::vector<cl::Platform> all_platforms;
    cl::Platform::get(&all_platforms);


    if (all_platforms.size() == 0){ // Check for issues
        std::cout << " No platforms found. Check OpenCL installation!\n";
        exit(1);
    }

    cl::Platform default_platform = all_platforms[0];
    std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << "\n";

    // Get default device of the default platform

    std::vector<cl::Device> all_devices;
    default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);

    if (all_devices.size() == 0){ // Check for issues
        std::cout << " No devices found. Check OpenCL installation!\n";
        exit(1);
    }

    cl::Device default_device = all_devices[0];
    std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>() << "\n";

    // Create an OpenCL context

    cl::Context context({ default_device });

    cl::Program program(context, util::loadProgram("saxy_kernel.cl"), true);

    if (program.build({ default_device }) != CL_SUCCESS){
        std::cout << " Error building: " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device) << "\n";
        getchar();
        exit(1);
    }

    // create buffers on the device
    cl::Buffer buffer_X(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_Y(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_S(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_A(context, CL_MEM_READ_WRITE, sizeof(int));

    //create queue to which we will push commands for the device.
    cl::CommandQueue queue(context, default_device);

    //write arrays A and B to the device
    queue.enqueueWriteBuffer(buffer_X, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_x[0]);
    queue.enqueueWriteBuffer(buffer_Y, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_y[0]);
    queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int), &A);

    StartCounter();
    //run the kernel
    cl::Kernel kernel_add = cl::Kernel(program, "simple_add");
    kernel_add.setArg(0, buffer_X);
    kernel_add.setArg(1, buffer_Y);
    kernel_add.setArg(2, buffer_S);
    kernel_add.setArg(3, buffer_A);

    cl::NDRange global(ROWS*COLUMNS);
    queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, global, cl::NullRange);
    queue.finish();

    std::cout << "Kernel execution time: " << GetCounter() << "ms \n";

    //read result C from the device to array C
    queue.enqueueReadBuffer(buffer_S, CL_TRUE, 0, sizeof(float)*ROWS*COLUMNS, &h_s[0]);



    /*Print vectors
    std::cout << "\nMatrix #1: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_x[i] << "\t ";

    }

    std::cout << "\n\nMatrix #2: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_y[i] << "\t ";

    }

    std::cout << "\n\nResult: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_s[i] << "\t ";

    }*/
    getchar();
    return 0;
}

and here is the kernel: 这是内核：

__kernel void kernel simple_add(
   __global float* X, 
   __global float* Y, 
   __global float* S, 
   __global int *A){

   S[get_global_id(0)] = X[get_global_id(0)] * Y[get_global_id(0)];

}

Could you please explain me the reason? 你能解释一下原因吗？ I know that I can achieve much better performance if I perform some algorithm optimizations, but I'm trying to figure out if this is the threshold of the "naive" implementation, or I'm doing something wrong (incorrect assignment of the work to groups). 我知道如果执行一些算法优化可以达到更好的性能，但是我试图弄清楚这是否是“天真的”实现的门槛，或者我做错了什么（将工作分配不正确）组）。

EDIT: Because I was asked for in comments, the GPU I'm running the kernel is an AMD R9 270/2GB RAM. 编辑：因为有人在评论中要求我，所以我运行内核的GPU是AMD R9 270 / 2GB RAM。 The CPU is an i7-4771 and the system has 8GB RAM. CPU是i7-4771，系统具有8GB RAM。

Answer 1

Writing an answer about "how to do more calculations per thread" because code-formatting is non-existent in comments, and also covering a little on memory usage... 写一个关于“如何在每个线程中执行更多计算”的答案，因为注释中不存在代码格式，并且还涉及了一些内存使用情况...

So, most OpenCL implementatins will need to run more than a couple of instructions per thread (and the right number of threads) for efficient performance. 因此，大多数OpenCL实现将需要在每个线程（以及正确数量的线程）上运行多个指令，以提高性能。 But like I said in comments, this is HIGHLY dependent on the actual architecture of the processing unit (GPU, CPU, or OpenCL-capable magical unit weaved from unicorn hair, whatever it may be) - each manufacturer of GPUs, CPUs and unicorn weavers have their own ideas of how to make a very efficient unit, and they all tend to change their mind as time flows too... ;) 就像我在评论中说的那样，这在很大程度上取决于处理单元的实际架构（GPU，CPU或具有OpenCL功能的神奇单元，是用独角兽的头发编织而成的，无论它是什么）-每个GPU，CPU和独角兽编织器制造商对于如何建立一个高效的单位有自己的想法，而且随着时间的流逝，他们都倾向于改变主意...;）

To do a little more work in one thread you could simply do: 要在一个线程中完成更多工作，您可以简单地执行以下操作：

#define NUM_PER_THREAD 16
__kernel void kernel simple_add(
 __global float* X, 
 __global float* Y, 
 __global float* S, 
 __global int *A)
{

   for(i = 0; i < NUM_PER_THREAD; i++)
   {
      size_t index = get_global_id(0)*NUM_PER_THREAD + i;
      S[index] = X[index] * Y[index];
   }
}

[This will do 1 x 16 blocks. [这将执行1 x 16块。 It gets a bit more fun to try to do 16 x 16 or something like that, but can be done if you know the size (width) of the matrix] 尝试执行16 x 16或类似的操作会更有趣，但是如果您知道矩阵的大小（宽度），则可以这样做]

Regarding memory: GPU's that have dedicated local memory (in other words most graphics cards) will work MUCH faster if all the data fits in the graphics memory. 关于内存：如果所有数据都适合图形内存，则具有专用本地内存（换句话说，大多数图形卡）的GPU将以更快的速度运行。 Accessing "main" memory involves one of two approaches: 访问“主”内存涉及以下两种方法之一：

long access times for each cache-line when the GPU is reading over the PCI-express bus [or whatever infrastructure is used] - this can be 100 or 1000x slower than "local" memory. 当GPU通过PCI-express总线[或使用任何基础架构]进行读取时，每个高速缓存行的访问时间较长-这可能比“本地”内存慢100或1000倍。 And the GPU also (most likely) has to ask the CPU if the memory content is in cache, and if so, wait further for the CPU to copy the data out to main memory... 而且GPU（最有可能）还必须询问CPU内存内容是否在缓存中，如果是，请进一步等待CPU将数据复制到主内存中...
"page in/out" where the GPU stops, sends an interrupt to the CPU, the CPU finds some suitable lump [lump in this context is the technical term for "some amount of memory most likely around 4K or multiple thereof"] of memory to "remove" from the GPU memory, and copies that out to main memory, then copies in the required other lump of memory to the GPU memory - similar to when the OS is swapping memory to/from the hard-disk. GPU停止的“页面入/出”，向CPU发送中断，CPU找到一些合适的块[在这种情况下，块指的是“一些内存的技术术语，最有可能在4K左右或其倍数”]从GPU内存中“删除”，然后将其复制到主内存中，然后将所需的其他块内存中的内容复制到GPU内存中-类似于OS将硬盘与硬盘交换内存时。 And if you are unlucky, the GPU also has to do some interesting cache or TLB flushing to ensure that the correct data is being used. 而且，如果您不走运，GPU还必须执行一些有趣的缓存或TLB刷新操作，以确保使用了正确的数据。

Note that I still (in the last hour or so) haven't got any particular insight in how the AMD/ATI GPU's work, or how their OpenCL driver works. 请注意，我仍然（在最后一个小时左右）对AMD / ATI GPU的工作方式或其OpenCL驱动程序的工作方式没有任何特别的了解。 The above is a mixture of guessing/knowing how GPUs work in general, understanding of how OpenCL works in general, and calculating the memory needed to store the three different arrays of 16K x 16K using float . 以上是猜测/了解GPU总体运行方式，了解OpenCL总体运行方式以及使用float计算存储三个不同的16K x 16K数组所需的内存的混合体。

OpenCL矩阵乘法速度

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-07-15 21:40:27

OpenCL矩阵乘法速度

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-07-15 21:40:27

解决方案1
2 已采纳 2015-07-15 21:40:27