I would like to make run an old N-body which uses OpenCL.
I have 2 cards NVIDIA A6000 with NVLink
, a component which binds from an hardware (and maybe software ?) point of view these 2 GPU cards.
But at the execution, I get the following result :
Here is the kernel code used (I have put pragma that I estimate useful for NVIDIA cards) :
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel
void
nbody_sim(
__global double4* pos ,
__global double4* vel,
int numBodies,
double deltaTime,
double epsSqr,
__local double4* localPos,
__global double4* newPosition,
__global double4* newVelocity)
{
unsigned int tid = get_local_id(0);
unsigned int gid = get_global_id(0);
unsigned int localSize = get_local_size(0);
// Gravitational constant
double G_constant = 227.17085e-74;
// Number of tiles we need to iterate
unsigned int numTiles = numBodies / localSize;
// position of this work-item
double4 myPos = pos[gid];
double4 acc = (double4) (0.0f, 0.0f, 0.0f, 0.0f);
for(int i = 0; i < numTiles; ++i)
{
// load one tile into local memory
int idx = i * localSize + tid;
localPos[tid] = pos[idx];
// Synchronize to make sure data is available for processing
barrier(CLK_LOCAL_MEM_FENCE);
// Calculate acceleration effect due to each body
// a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
for(int j = 0; j < localSize; ++j)
{
// Calculate acceleration caused by particle j on particle i
double4 r = localPos[j] - myPos;
double distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
double invDist = 1.0f / sqrt(distSqr + epsSqr);
double invDistCube = invDist * invDist * invDist;
double s = G_constant * localPos[j].w * invDistCube;
// accumulate effect of all particles
acc += s * r;
}
// Synchronize so that next tile can be loaded
barrier(CLK_LOCAL_MEM_FENCE);
}
double4 oldVel = vel[gid];
// updated position and velocity
double4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
newPos.w = myPos.w;
double4 newVel = oldVel + acc * deltaTime;
// write to global memory
newPosition[gid] = newPos;
newVelocity[gid] = newVel;
}
The part of code which sets up the Kernel code is below :
int NBody::setupCL()
{
cl_int status = CL_SUCCESS;
cl_event writeEvt1, writeEvt2;
// The block is to move the declaration of prop closer to its use
cl_command_queue_properties prop = 0;
commandQueue = clCreateCommandQueue(
context,
devices[current_device],
prop,
&status);
CHECK_OPENCL_ERROR( status, "clCreateCommandQueue failed.");
...
// create a CL program using the kernel source
const char *kernelName = "NBody_Kernels.cl";
FILE *fp = fopen(kernelName, "r");
if (!fp) {
fprintf(stderr, "Failed to load kernel.\n");
exit(1);
}
char *source = (char*)malloc(10000);
int sourceSize = fread( source, 1, 10000, fp);
fclose(fp);
// Create a program from the kernel source
program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);
// Build the program
status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);
// get a kernel object handle for a kernel with the given name
kernel = clCreateKernel(
program,
"nbody_sim",
&status);
CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");
status = waitForEventAndRelease(&writeEvt1);
CHECK_ERROR(status, NBODY_SUCCESS, "WaitForEventAndRelease(writeEvt1) Failed");
status = waitForEventAndRelease(&writeEvt2);
CHECK_ERROR(status, NBODY_SUCCESS, "WaitForEventAndRelease(writeEvt2) Failed");
return NBODY_SUCCESS;
}
So, the errors occurs at the creation of the Kernel code . Is there a way to consider the 2 GPU
as a unique GPU with NVLINK component
? I mean from a software point of view ?
How can I fix this error of creation of Kernel code ?
Your kernel code looks good and the cache tiling implementation is correct. Only make sure that the number of bodies is a multiple of local size, or alternatively limit the inner for loop to the global size additionally.
OpenCL allows usage of multiple devices in parallel. You need to make a thread with a queue for each device separately . You also need to take care of device-device communications and synchronization manually. Data transfer happens over PCIe (you also can do remote direct memory access); but you can't use NVLink with OpenCL. This should not be an issue in your case though as you need only little data transfer compared to the amount of arithmetic.
A few more remarks:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.