OpenCL Matrix Multiplication Speed

Question

I wrote a small OpenCL application which calculates the product of two matrices. Now I've noticed that if the size of the matrix exceeds 8192 x 8192 there is a significant performance drop (calculation for a 16384 x 16384 is ~80 times slower) and even the serial implementation is over 5 times faster. Here is the host code:

/*Make some includes and definitions here*/
#include "stdafx.h"
#include <CL/cl.hpp>

#include <vector>
#include <iostream>

#include "util.hpp" // utility library

#define __CL_ENABLE_EXCEPTIONS
#define ROWS (16384)    // ROWS of vectors a, b, and c
#define COLUMNS (16384)

/*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/
#include "metrics.h"

/*Start main()*/

int main(void)
{
    int A;

    // Fill vectors X and Y with random float values

    float* h_x = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_x[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
        }
    }
    float* h_y = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_y[j + i*COLUMNS] = rand() / (float)RAND_MAX;;
        }
    }
    float* h_s = new float[ROWS*COLUMNS];
    for (int i = 0; i < ROWS; ++i){
        for (int j = 0; j < COLUMNS; ++j){
            h_s[j + i*COLUMNS] = 0.0;
        }
    }

    /*~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~*/

    // Get all platforms (drivers)

    std::vector<cl::Platform> all_platforms;
    cl::Platform::get(&all_platforms);


    if (all_platforms.size() == 0){ // Check for issues
        std::cout << " No platforms found. Check OpenCL installation!\n";
        exit(1);
    }

    cl::Platform default_platform = all_platforms[0];
    std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << "\n";

    // Get default device of the default platform

    std::vector<cl::Device> all_devices;
    default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);

    if (all_devices.size() == 0){ // Check for issues
        std::cout << " No devices found. Check OpenCL installation!\n";
        exit(1);
    }

    cl::Device default_device = all_devices[0];
    std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>() << "\n";

    // Create an OpenCL context

    cl::Context context({ default_device });

    cl::Program program(context, util::loadProgram("saxy_kernel.cl"), true);

    if (program.build({ default_device }) != CL_SUCCESS){
        std::cout << " Error building: " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device) << "\n";
        getchar();
        exit(1);
    }

    // create buffers on the device
    cl::Buffer buffer_X(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_Y(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_S(context, CL_MEM_READ_WRITE, sizeof(float)* ROWS*COLUMNS);
    cl::Buffer buffer_A(context, CL_MEM_READ_WRITE, sizeof(int));

    //create queue to which we will push commands for the device.
    cl::CommandQueue queue(context, default_device);

    //write arrays A and B to the device
    queue.enqueueWriteBuffer(buffer_X, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_x[0]);
    queue.enqueueWriteBuffer(buffer_Y, CL_TRUE, 0, sizeof(float)* ROWS*COLUMNS, &h_y[0]);
    queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int), &A);

    StartCounter();
    //run the kernel
    cl::Kernel kernel_add = cl::Kernel(program, "simple_add");
    kernel_add.setArg(0, buffer_X);
    kernel_add.setArg(1, buffer_Y);
    kernel_add.setArg(2, buffer_S);
    kernel_add.setArg(3, buffer_A);

    cl::NDRange global(ROWS*COLUMNS);
    queue.enqueueNDRangeKernel(kernel_add, cl::NullRange, global, cl::NullRange);
    queue.finish();

    std::cout << "Kernel execution time: " << GetCounter() << "ms \n";

    //read result C from the device to array C
    queue.enqueueReadBuffer(buffer_S, CL_TRUE, 0, sizeof(float)*ROWS*COLUMNS, &h_s[0]);



    /*Print vectors
    std::cout << "\nMatrix #1: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_x[i] << "\t ";

    }

    std::cout << "\n\nMatrix #2: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_y[i] << "\t ";

    }

    std::cout << "\n\nResult: \n";
    for (int i = 0; i<ROWS*COLUMNS; i++){


            std::cout << "" << h_s[i] << "\t ";

    }*/
    getchar();
    return 0;
}

and here is the kernel:

__kernel void kernel simple_add(
   __global float* X, 
   __global float* Y, 
   __global float* S, 
   __global int *A){

   S[get_global_id(0)] = X[get_global_id(0)] * Y[get_global_id(0)];

}

Could you please explain me the reason? I know that I can achieve much better performance if I perform some algorithm optimizations, but I'm trying to figure out if this is the threshold of the "naive" implementation, or I'm doing something wrong (incorrect assignment of the work to groups).

EDIT: Because I was asked for in comments, the GPU I'm running the kernel is an AMD R9 270/2GB RAM. The CPU is an i7-4771 and the system has 8GB RAM.

Answer 1

Writing an answer about "how to do more calculations per thread" because code-formatting is non-existent in comments, and also covering a little on memory usage...

So, most OpenCL implementatins will need to run more than a couple of instructions per thread (and the right number of threads) for efficient performance. But like I said in comments, this is HIGHLY dependent on the actual architecture of the processing unit (GPU, CPU, or OpenCL-capable magical unit weaved from unicorn hair, whatever it may be) - each manufacturer of GPUs, CPUs and unicorn weavers have their own ideas of how to make a very efficient unit, and they all tend to change their mind as time flows too... ;)

To do a little more work in one thread you could simply do:

#define NUM_PER_THREAD 16
__kernel void kernel simple_add(
 __global float* X, 
 __global float* Y, 
 __global float* S, 
 __global int *A)
{

   for(i = 0; i < NUM_PER_THREAD; i++)
   {
      size_t index = get_global_id(0)*NUM_PER_THREAD + i;
      S[index] = X[index] * Y[index];
   }
}

[This will do 1 x 16 blocks. It gets a bit more fun to try to do 16 x 16 or something like that, but can be done if you know the size (width) of the matrix]

Regarding memory: GPU's that have dedicated local memory (in other words most graphics cards) will work MUCH faster if all the data fits in the graphics memory. Accessing "main" memory involves one of two approaches:

long access times for each cache-line when the GPU is reading over the PCI-express bus [or whatever infrastructure is used] - this can be 100 or 1000x slower than "local" memory. And the GPU also (most likely) has to ask the CPU if the memory content is in cache, and if so, wait further for the CPU to copy the data out to main memory...
"page in/out" where the GPU stops, sends an interrupt to the CPU, the CPU finds some suitable lump [lump in this context is the technical term for "some amount of memory most likely around 4K or multiple thereof"] of memory to "remove" from the GPU memory, and copies that out to main memory, then copies in the required other lump of memory to the GPU memory - similar to when the OS is swapping memory to/from the hard-disk. And if you are unlucky, the GPU also has to do some interesting cache or TLB flushing to ensure that the correct data is being used.

Note that I still (in the last hour or so) haven't got any particular insight in how the AMD/ATI GPU's work, or how their OpenCL driver works. The above is a mixture of guessing/knowing how GPUs work in general, understanding of how OpenCL works in general, and calculating the memory needed to store the three different arrays of 16K x 16K using float .

OpenCL Matrix Multiplication Speed

Question

1 answers

solution1
2 ACCPTED 2015-07-15 21:40:27

OpenCL Matrix Multiplication Speed

Question

1 answers

solution1 2 ACCPTED 2015-07-15 21:40:27

solution1
2 ACCPTED 2015-07-15 21:40:27