OpenCL内核比普通Java循环慢

Question

I've been looking into OpenCL for use with optimizing code and running tasks in parallel to achieve greater speed over pure Java. 我一直在研究OpenCL，以便与优化代码和并行运行任务一起使用，以实现比纯Java更高的速度。 Now I'm having a bit of an issue. 现在我有一个问题。

I've put together a Java program using LWJGL, which as far as I can tell,should be able to do nearly identical tasks -- in this case adding elements from two arrays together and storing the result in another array -- two separate ways: one with pure Java, and the other with an OpenCL Kernel. 我用LWJGL编写了一个Java程序，据我所知，它应该能够完成几乎相同的任务-在这种情况下，将两个数组中的元素加在一起并将结果存储在另一个数组中-两种不同的方式：一个使用纯Java，另一个使用OpenCL内核。 I'm using System.currentTimeMillis() to keep track of how long each one takes for arrays with a large number of elements(~10,000,000). 我正在使用System.currentTimeMillis()来跟踪每个元素花费大量元素（〜10,000,000）的数组的时间。 For whatever reason, the pure java loop seems to be executing around 3 to 10 times, depending on array size, faster than the CL program. 无论出于何种原因，根据数组大小，纯Java循环似乎比CL程序快执行大约3至10倍。 My code is as follows(imports omitted): 我的代码如下（省略了导入）：

public class TestCL {

    private static final int SIZE = 9999999; //Size of arrays to test, this value is changed sometimes in between tests

    private static CLContext context; //CL Context
    private static CLPlatform platform; //CL platform
    private static List<CLDevice> devices; //List of CL devices
    private static CLCommandQueue queue; //Command Queue for context
    private static float[] aData, bData, rData; //float arrays to store test data

    //---Kernel Code---
    //The actual kernel script is here:
    //-----------------
    private static String kernel = "kernel void sum(global const float* a, global const float* b, global float* result, int const size){\n" + 
            "const int itemId = get_global_id(0);\n" + 
            "if(itemId < size){\n" + 
            "result[itemId] = a[itemId] + b[itemId];\n" +
            "}\n" +
            "}";;

    public static void main(String[] args){

        aData = new float[SIZE];
        bData = new float[SIZE];
        rData = new float[SIZE]; //Only used for CPU testing

        //arbitrary testing data
        for(int i=0; i<SIZE; i++){
            aData[i] = i;
            bData[i] = SIZE - i;
        }

        try {
            testCPU(); //How long does it take running in traditional Java code on the CPU?
            testGPU(); //How long does the GPU take to run it w/ CL?
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    /**
     * Test the CPU with pure Java code
     */
    private static void testCPU(){
        long time = System.currentTimeMillis();
        for(int i=0; i<SIZE; i++){
            rData[i] = aData[i] + bData[i];
        }
        //Print the time FROM THE START OF THE testCPU() FUNCTION UNTIL NOW
        System.out.println("CPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));
    }

    /**
     * Test the GPU with OpenCL
     * @throws LWJGLException
     */
    private static void testGPU() throws LWJGLException {
        CLInit(); //Initialize CL and CL Objects

        //Create the CL Program
        CLProgram program = CL10.clCreateProgramWithSource(context, kernel, null);

        int error = CL10.clBuildProgram(program, devices.get(0), "", null);
        Util.checkCLError(error);

        //Create the Kernel
        CLKernel sum = CL10.clCreateKernel(program, "sum", null);

        //Error checker
        IntBuffer eBuf = BufferUtils.createIntBuffer(1);

        //Floatbuffer for the first array of floats
        FloatBuffer aBuf = BufferUtils.createFloatBuffer(SIZE);
        aBuf.put(aData);
        aBuf.rewind();
        CLMem aMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, aBuf, eBuf);
        Util.checkCLError(eBuf.get(0));

        //And the second
        FloatBuffer bBuf = BufferUtils.createFloatBuffer(SIZE);
        bBuf.put(bData);
        bBuf.rewind();
        CLMem bMem = CL10.clCreateBuffer(context, CL10.CL_MEM_WRITE_ONLY | CL10.CL_MEM_COPY_HOST_PTR, bBuf, eBuf);
        Util.checkCLError(eBuf.get(0));

        //Memory object to store the result
        CLMem rMem = CL10.clCreateBuffer(context, CL10.CL_MEM_READ_ONLY, SIZE * 4, eBuf);
        Util.checkCLError(eBuf.get(0));

        //Get time before setting kernel arguments
        long time = System.currentTimeMillis();

        sum.setArg(0, aMem);
        sum.setArg(1, bMem);
        sum.setArg(2, rMem);
        sum.setArg(3, SIZE);

        final int dim = 1;
        PointerBuffer workSize = BufferUtils.createPointerBuffer(dim);
        workSize.put(0, SIZE);

        //Actually running the program
        CL10.clEnqueueNDRangeKernel(queue, sum, dim, null, workSize, null, null, null);
        CL10.clFinish(queue);

        //Write results to a FloatBuffer
        FloatBuffer res = BufferUtils.createFloatBuffer(SIZE);
        CL10.clEnqueueReadBuffer(queue, rMem, CL10.CL_TRUE, 0, res, null, null);

        //How long did it take?
        //Print the time FROM THE SETTING OF KERNEL ARGUMENTS UNTIL NOW
        System.out.println("GPU processing time for " + SIZE + " elements: " + (System.currentTimeMillis() - time));

        //Cleanup objects
        CL10.clReleaseKernel(sum);
        CL10.clReleaseProgram(program);
        CL10.clReleaseMemObject(aMem);
        CL10.clReleaseMemObject(bMem);
        CL10.clReleaseMemObject(rMem);

        CLCleanup();
    }

    /**
     * Initialize CL objects
     * @throws LWJGLException
     */
    private static void CLInit() throws LWJGLException {
        IntBuffer eBuf = BufferUtils.createIntBuffer(1);

        CL.create();

        platform = CLPlatform.getPlatforms().get(0);
        devices = platform.getDevices(CL10.CL_DEVICE_TYPE_GPU);
        context = CLContext.create(platform, devices, eBuf);
        queue = CL10.clCreateCommandQueue(context, devices.get(0), CL10.CL_QUEUE_PROFILING_ENABLE, eBuf);

        Util.checkCLError(eBuf.get(0));
    }

    /**
     * Cleanup after CL completion
     */
    private static void CLCleanup(){
        CL10.clReleaseCommandQueue(queue);
        CL10.clReleaseContext(context);
        CL.destroy();
    }

}

Here are a few example console results from various tests: 以下是一些来自各种测试的示例控制台结果：

CPU processing time for 10000000 elements: 24
GPU processing time for 10000000 elements: 88

CPU processing time for 1000000 elements: 7
GPU processing time for 1000000 elements: 10

CPU processing time for 100000000 elements: 193
GPU processing time for 100000000 elements: 943

Is there something wrong with my coding that's causing the CL to take faster, or is that actually to be expected in cases such as this? 我的编码是否有问题，导致CL速度更快，还是在这种情况下确实可以预期？ If the case is the latter, then when is CL preferable? 如果是后者，那么什么时候CL是可取的？

Answer 1

I revised the test to do something which I believe is more computationally expensive than simple addition. 我修改了测试以执行一些我认为比简单加法在计算上更加昂贵的事情。

Regarding the CPU test, the line: 关于CPU测试，该行：

rData[i] = aData[i] + bData[i];

was changed to: 更改为：

rData[i] = (float)(Math.sin(aData[i]) * Math.cos(bData[i]));

And in the CL kernel, the line: 在CL内核中，该行：

result[itemId] = a[itemId] + b[itemId];

was changed to: 更改为：

result[itemId] = sin(a[itemId]) * cos(b[itemId]);

I'm now getting console results such as: 我现在正在获得控制台结果，例如：

CPU processing time for 1000000 elements: 154
GPU processing time for 1000000 elements: 11

CPU processing time for 10000000 elements: 8699
GPU processing time for 10000000 elements: 98

(The CPU is taking longer than I'd like to bother with for tests of 100000000 elements.) （对于100000000个元素的测试，CPU花费的时间比我要花的时间长。）

For checking accuracy, I added checks that compare an arbitrary element of rData and res to ensure they're the same. 为了检查准确性，我添加了对rData和res的任意元素进行比较的检查，以确保它们相同。 I omitted the result here, as it should suffice to say that they were equal. 我在这里省略了结果，因为可以说它们相等。

Now that the function is more complicated(two trigonometric functions being multiplied together), it appears that the CL kernel is much more efficient than the pure Java loop. 现在函数更加复杂（两个三角函数被相乘），看来CL内核比纯Java循环更有效。

OpenCL内核比普通Java循环慢

问题描述

1 个解决方案

解决方案1
0 已采纳 2016-01-02 18:43:34

OpenCL内核比普通Java循环慢

问题描述

1 个解决方案

解决方案1 0 已采纳 2016-01-02 18:43:34

解决方案1
0 已采纳 2016-01-02 18:43:34