How to reduce the time cost of parallel_for in DPC++?

Question

I've wrote the following code in DPC++ to test time consumption.

// ignore sth for defining subdevices
cl::sycl::queue q[4] = {cl::sycl::queue{SubDevices1[0]}, cl::sycl::queue{SubDevices1[1]},
                        cl::sycl::queue{SubDevices2[0]}, cl::sycl::queue{SubDevices2[1]}};

void run(){
    for(int i = 0; i < 4; i++){
        q[i].submit([&](auto &h) {
        h.parallel_for(
            sycl::nd_range<2>(sycl::range<2>(1, 1), sycl::range<2>(1, 1)),
            [=](sycl::nd_item<2> it){
                // just empty
                }
            );
        });
    }
}

It cost about 0.6s.

When testing for one queue with one parallel_for, it cost about 0.15s.

A more wired thing happened when testing

q[i].submit([&](auto &h) {h.memcpy(...);});

When the array copied is small, this command consumes nearly no time.

How to optimize the above code in run()? Very thanks!

Answer 1

If you run on different devices then all queues will execute parallelly.

If you want to run on a single device, you need to create a context for each queue then it will execute in a parallel manner.

context c1{};
queue q1{c1,gpu_selector()};

How to reduce the time cost of parallel_for in DPC++?

Question

1 answers

solution1
0 2022-08-22 06:20:26

How to reduce the time cost of parallel_for in DPC++?

Question

1 answers

solution1 0 2022-08-22 06:20:26

solution1
0 2022-08-22 06:20:26