I'm using the advice given here for choosing an optimal GPU for my algorithm. https://stackoverflow.com/a/33488953/5371117
I query the devices on my MacBook Pro using boost::compute::system::devices();
which returns me following list of devices.
Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
Intel(R) UHD Graphics 630
AMD Radeon Pro 560X Compute Engine
I want to use AMD Radeon Pro 560X Compute Engine
for my purpose but when I iterate to find the device with maximum rating = CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS
. I get the following results:
Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz,
freq: 2600, compute units: 12, rating:31200
Intel(R) UHD Graphics 630,
freq: 1150, units: 24, rating:27600
AMD Radeon Pro 560X Compute Engine,
freq: 300, units: 16, rating:4800
AMD GPU has the lowest rating. Also I looked into the specs and it seems to me that CL_DEVICE_MAX_CLOCK_FREQUENCY
isn't returning correct value.
According to AMD Chip specs https://www.amd.com/en/products/graphics/radeon-rx-560x , my AMD GPU has base frequency of 1175 MHz, not 300MHz .
According to Intel Chip specs https://en.wikichip.org/wiki/intel/uhd_graphics/630 , my Intel GPU has base frequency of 300 MHz, not 1150MHz , but it does have a boost frequency of 1150MHz
std::vector<boost::compute::device> devices = boost::compute::system::devices();
std::pair<boost::compute::device, ai::int64> suitableDevice{};
for(auto& device: devices)
{
auto rating = device.clock_frequency() * device.compute_units();
std::cout << device.name() << ", freq: " << device.clock_frequency() << ", units: " << device.compute_units() << ", rating:" << rating << std::endl;
if(suitableDevice.second < benchmark)
{
suitableDevice.first = device;
suitableDevice.second = benchmark;
}
}
Am I doing anything wrong?
Those properties are unfortunately only really directly comparable within an implementation (same HW manufacturer, same OS).
My recommendation would be to:
CL_DEVICE_TYPE_GPU
(unless there aren't any GPUs available, in which case you may want to fall back to CPU).CL_DEVICE_HOST_UNIFIED_MEMORY
property. These will be integrated GPUs, and these are usually slower than discrete ones, unless you are bound by data transfer speeds, in which case they might be faster. So you'll want to prefer one type over the other.This code will return the device with the most floating-point performance
select_device_with_most_flops(find_devices());
and this the device with the most memory
select_device_with_most_memory(find_devices());
At first, find_devices()
returns a vector of all OpenCL devices in the system. select_device_with_most_memory()
is straightforward and uses getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()
.
Floating-point performance is given by this equation: FLOPs/s = cores/CU * CUs * IPC * clock frequency
select_device_with_most_flops()
is more difficult, because OpenCL does only provide the number of compute units (CUs) getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>()
, which for a CPU is the number of threads and for a GPU has to be multiplied by the number of stream processors / cuda cores per CU , which is different for Nvidia, AMD and Intel as well as their different microarchitectures and is usually between 4 and 128. Luckily, the vendor is included in getInfo<CL_DEVICE_VENDOR>()
. So based on the vendor and the amount of CUs one can figure out the number of cores per CU.
The next part is the FP32 IPC or instructions per clock . For most GPUs, this is 2, while for recent CPUs this is 32, see https://en.wikipedia.org/wiki/FLOPS?oldformat=true#FLOPs_per_cycle_for_various_processors There is no way to figure out the IPC in OpenCL directly, so the 32 for CPUs is just a guess. One could use the device name and a lookup table to be more accurate. getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU
will result in true if the device is a GPU.
The final part is the clock frequency. OpenCL provides the base clock frequency in MHz by getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()
. The device can boost higher frequencies, so this again is an approximation.
All of it together gives an estimation for the floating-point performance. The full code is shown below:
typedef unsigned int uint;
string trim(const string s) { // removes whitespace characters from beginnig and end of string s
const int l = (int)s.length();
int a=0, b=l-1;
char c;
while(a<l && ((c=s.at(a))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) a++;
while(b>a && ((c=s.at(b))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) b--;
return s.substr(a, 1+b-a);
}
bool contains(const string s, const string match) {
return s.find(match)!=string::npos;
}
vector<Device> find_devices() {
vector<Platform> platforms; // get all platforms (drivers)
vector<Device> devices_available;
vector<Device> devices; // get all devices of all platforms
Platform::get(&platforms);
if(platforms.size()==0) print_error("There are no OpenCL devices available. Make sure that the OpenCL 1.2 Runtime for your device is installed. For GPUs it comes by default with the graphics driver, for CPUs it has to be installed separately.");
for(uint i=0; i<(uint)platforms.size(); i++) {
devices_available.clear();
platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &devices_available); // CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU
if(devices_available.size()==0) continue; // no device of type device_type found in plattform i
for(uint j=0; j<(uint)devices_available.size(); j++) devices.push_back(devices_available[j]);
}
print_device_list(devices);
return devices;
}
Device select_device_with_most_flops(const vector<Device> devices) { // return device with best floating-point performance
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
//const string device_name = trim(d.getInfo<CL_DEVICE_NAME>());
const string device_vendor = trim(d.getInfo<CL_DEVICE_VENDOR>()); // is either Nvidia, AMD or Intel
const uint device_compute_units = (uint)d.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
const bool device_is_gpu = d.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
const uint device_ipc = device_is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
const uint nvidia = (uint)(contains(device_vendor, "NVIDIA")||contains(device_vendor, "vidia"))*(device_compute_units<=30u?128u:64u); // Nvidia GPUs usually have 128 cores/CU, except Volta/Turing (>30 CUs) which have 64 cores/CU
const uint amd = (uint)(contains(device_vendor, "AMD")||contains(device_vendor, "ADVANCED")||contains(device_vendor, "dvanced"))*(device_is_gpu?64u:1u); // AMD GCN GPUs usually have 64 cores/CU, AMD CPUs have 1 core/CU
const uint intel = (uint)(contains(device_vendor, "INTEL")||contains(device_vendor, "ntel"))*(device_is_gpu?8u:1u); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs have 1 core/CU
const uint device_cores = device_compute_units*(nvidia+amd+intel);
const uint device_clock_frequency = (uint)d.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
const float device_tflops = 1E-6f*(float)device_cores*(float)device_ipc*(float)device_clock_frequency; // estimated device floating point performance in TeraFLOPs/s
if(device_tflops>best_value) { // device_memory>best_value
best_value = device_tflops; // best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_most_memory(const vector<Device> devices) { // return device with largest memory capacity
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
const float device_memory = 1E-3f*(float)(d.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()/1048576ull); // in GB
if(device_memory>best_value) {
best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_id(const vector<Device> devices, const int id) { // return device
if(id>=0&&id<(int)devices.size()) {
return devices[id];
} else {
print("Your selected device ID ("+to_string(id)+") is wrong.");
return devices[0]; // is never executed, just to avoid compiler warnings
}
}
UPDATE: I have now included an improved version of this in a lightweight OpenCL-Wrapper. This correctly calculates the FLOPs for all CPUs and GPUs from the last decade or so: https://github.com/ProjectPhysX/OpenCL-Wrapper
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.