简体   繁体   中英

OpenCL simple matrix multiplication not returning proper results

I am trying to multiply 2 square matrices (32x32) using OpenCL with c++ host. I am trying to reproduce results from a book (OpenCL Programming By Example - R Banger, K Bhattacharyya) and a base from here . However, the results are just wrong. I checked all the parts of the code and I came to deduce that the "enqueueNDRangeKernel" seems to be wrongly setup. Could anyone please help me through this bottleneck? I am using an NVIDIA MX250 graphics card, but I guess this code is going on the integrated Intel GPU. Here is the codes:

#pragma comment(lib, "OpenCL.lib")

#include <iostream>
#include <CL/cl.hpp>
#include <vector>
#include <chrono>

using namespace std;

int main()
{
    int i;
    int dim = 1024;
    float* A = (float*)malloc(sizeof(float) * dim * dim);
    float* B = (float*)malloc(sizeof(float) * dim * dim);
    float* C = (float*)malloc(sizeof(float) * dim * dim);
    for (i = 0; i < dim * dim; i++)
    {
        A[i] = (float)(rand() % 10);
        B[i] = (float)(rand() % 10);
        C[i] = 0;
    }

    //get all platforms (drivers)
    std::vector<cl::Platform> all_platforms;
    cl::Platform::get(&all_platforms);
    if (all_platforms.size() == 0) {
        std::cout << " No platforms found. Check OpenCL installation!\n";
        exit(1);
    }
    cl::Platform default_platform = all_platforms[0];
    std::cout << "Using platform: " << default_platform.getInfo<CL_PLATFORM_NAME>() << "\n";

    //get default device of the default platform
    std::vector<cl::Device> all_devices;
    default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
    if (all_devices.size() == 0) {
        std::cout << " No devices found. Check OpenCL installation!\n";
        exit(1);
    }
    cl::Device default_device = all_devices[0];
    std::cout << "Using device: " << default_device.getInfo<CL_DEVICE_NAME>() << "\n";


    cl::Context context({ default_device });

    cl::Program::Sources sources;

    // kernel calculates for each element C=A+B
    std::string kernel_code =
        "   void kernel simple_add(global const float* A, global const float* B, global float* C, int dim){       "
        "       int iCol = get_global_id(0);                            "
        "       int iRow = get_global_id(1);                            "
        "       float result = 0.0;                                     "
        "       for (int i = 0; i < dim; ++i) "
        "       { "
        "           result += A[iRow * dim + i] * B[i * dim + iCol]; "
        "       }   "
        "           "
        "       C[iRow * dim + iCol] = result; "
        "   }                                                                               ";

    sources.push_back({ kernel_code.c_str(),kernel_code.length() });

    cl::Program program(context, sources);
    if (program.build({ default_device }) != CL_SUCCESS) {
        std::cout << " Error building: " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device) << "\n";
        exit(1);
    }


    // create buffers on the device
    cl::Buffer buffer_A(context, CL_MEM_READ_WRITE, sizeof(int) * dim);
    cl::Buffer buffer_B(context, CL_MEM_READ_WRITE, sizeof(int) * dim);
    cl::Buffer buffer_C(context, CL_MEM_READ_WRITE, sizeof(int) * dim);

    //create queue to which we will push commands for the device.
    cl::CommandQueue queue(context, default_device);

    //write arrays A and B to the device
    queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(float) * dim, A);
    queue.enqueueWriteBuffer(buffer_B, CL_TRUE, 0, sizeof(float) * dim, B);

    //run the kernel

    cl::Kernel simple_add(program, "simple_add");
    simple_add.setArg(0, buffer_A);
    simple_add.setArg(1, buffer_B);
    simple_add.setArg(2, buffer_C);
    simple_add.setArg(3, dim);

    cl::NDRange global(32, 32);

    queue.enqueueNDRangeKernel(simple_add, cl::NullRange, global, cl::NullRange);
    queue.finish();

    //  float C[10];
        //read result C from the device to array C
    queue.enqueueReadBuffer(buffer_C, CL_TRUE, 0, sizeof(float) * dim, C);

    for (int i = 0; i < dim; i++) {
        std::cout << C[i] << " ";
        C[i] = 0.0;
    }
}

I could finally find the bug.. The issue was that I was passing dim = 1024, the length of the linearized matrix (vector). It should be 32. Just in case it helps someone sometime in future. Thanks to all who viewed tried to answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM