buffers in CCL code samples along with the oneapi toolkit

Question

I Was going through the CCL code samples along with the oneapi toolkit. In the below DPC++(SYCL) code initially sendbuf a buffer is created in the cpu side and is not initialised and in the part where offloading to target device takes place the dev_acc_sbuf[id] variable, which is a variable in the kernel scope is modified. This variable(dev_acc_sbuf) is not hence used in the program neither is its value copied back to sendbuf.Then in the next line the sendbuf variable is used for allreduce. I am not able to understand how changing the dev_acc_sbuf makes change in the sendbuf.

          cl::sycl::queue q;
cl::sycl::buffer<int, 1> sendbuf(COUNT);
          /* open sendbuf and modify it on the target device side */
q.submit([&](cl::sycl::handler& cgh) {
   auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh);
   cgh.parallel_for<class allreduce_test_sbuf_modify>(range<1>{COUNT}, [=](item<1> id) {
       dev_acc_sbuf[id] += 1;
   });
});
/* invoke ccl_allreduce on the CPU side */
ccl_allreduce(&sendbuf,
              &recvbuf,
              COUNT,
              ccl_dtype_int,
              ccl_reduction_sum,
              NULL,
              NULL,
              stream,
              &request);

Answer 1

In the line " auto dev_acc_sbuf = sendbuf.get_access<mode::write>(cgh); " the dev_acc_sbuf is a handle that accesses sendbuf and not a seperate buffer. The changes made in the dev_acc_sbuf handle gets reflected to the original buffer ie the sendbuffer. This is an advantage in SYCL as the changes made in the kernel scope is automatically copied back to the original variable

Answer 2

On most systems, the host and the device do not share physical memory, the CPU might use RAM and the GPU might use its own global memory. SYCL needs to know which data it will be sharing between the host and the devices.

For this purpose, SYCL uses its buffers, the buffer class is generic over the element type and the number of dimensions. When passed a raw pointer, the buffer(T* ptr, range size) constructor takes ownership of the memory it has been passed. This means that we absolutely cannot use that memory ourselves while the buffer exists, which is why we begin a C++ scope. At the end of their scope, the buffers will be destroyed and the memory returned to the user. A size argument is a range object, which has to have the same number of dimensions as the buffer and is initialized with the number of elements in each dimension. Here, we have one dimension with one element.

Buffers are not associated with a particular queue or context, so they are capable of handling data transparently between multiple devices.

Accessors are used to access request control over the device memory from the buffer objects. Their modes will take care of data movement between host and device. So we need not have to explicitly copy back the result from device to host.

Below is the example for more clarification:

#include <bits/stdc++.h>
#include <CL/sycl.hpp>

using namespace std;
class vector_addition;

int main(int, char**) {
   //creating host memory
   int *a=(int *)malloc(10*sizeof(int));
   int *b=(int *)malloc(10*sizeof(int));
   int *c=(int *)malloc(10*sizeof(int));

   for(int i=0;i<10;i++){
       a[i]=i;
       b[i]=10-i;
   }

   cl::sycl::default_selector device_selector;

   cl::sycl::queue queue(device_selector);
   std::cout << "Running on "<< queue.get_device().get_info<cl::sycl::info::device::name>()<< "\n";

 {
    //creating buffer from pointer of host memory
    cl::sycl::buffer<int, 1> a_sycl{a, cl::sycl::range<1>{10} };
    cl::sycl::buffer<int, 1> b_sycl{b, cl::sycl::range<1>{10} };
    cl::sycl::buffer<int, 1> c_sycl{c, cl::sycl::range<1>{10} };

    queue.submit([&] (cl::sycl::handler& cgh) {
       //creating accessor of buffer with proper mode
       auto a_acc = a_sycl.get_access<cl::sycl::access::mode::read>(cgh);
       auto b_acc = b_sycl.get_access<cl::sycl::access::mode::read>(cgh);
       auto c_acc = c_sycl.get_access<cl::sycl::access::mode::write>(cgh);//responsible for copying back to host memory 

       //kernel for execution
       cgh.parallel_for<class vector_addition>(cl::sycl::range<1>{ 10 }, [=](cl::sycl::id<1> idx) {
       c_acc[idx] = a_acc[idx] + b_acc[idx];
       });

    });
 }

 for(int i=0;i<10;i++){
     cout<<c[i]<<" ";    
 }
 cout<<"\n";
 return 0;
}

buffers in CCL code samples along with the oneapi toolkit

Question

2 answers

solution1
3 ACCPTED 2019-11-20 13:09:59

solution2
2 2019-11-26 07:41:23

buffers in CCL code samples along with the oneapi toolkit

Question

2 answers

solution1 3 ACCPTED 2019-11-20 13:09:59

solution2 2 2019-11-26 07:41:23

solution1
3 ACCPTED 2019-11-20 13:09:59

solution2
2 2019-11-26 07:41:23