简体   繁体   中英

How to reduce the time cost of parallel_for in DPC++?

I've wrote the following code in DPC++ to test time consumption.

// ignore sth for defining subdevices
cl::sycl::queue q[4] = {cl::sycl::queue{SubDevices1[0]}, cl::sycl::queue{SubDevices1[1]},
                        cl::sycl::queue{SubDevices2[0]}, cl::sycl::queue{SubDevices2[1]}};

void run(){
    for(int i = 0; i < 4; i++){
        q[i].submit([&](auto &h) {
        h.parallel_for(
            sycl::nd_range<2>(sycl::range<2>(1, 1), sycl::range<2>(1, 1)),
            [=](sycl::nd_item<2> it){
                // just empty
                }
            );
        });
    }
}

It cost about 0.6s.

When testing for one queue with one parallel_for, it cost about 0.15s.

A more wired thing happened when testing

q[i].submit([&](auto &h) {h.memcpy(...);});

When the array copied is small, this command consumes nearly no time.

How to optimize the above code in run()? Very thanks!

If you run on different devices then all queues will execute parallelly.

If you want to run on a single device, you need to create a context for each queue then it will execute in a parallel manner.

context c1{};
queue q1{c1,gpu_selector()};

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM