Problem in choosing best available GPU using openCL programmatically

Question

I'm using the advice given here for choosing an optimal GPU for my algorithm. https://stackoverflow.com/a/33488953/5371117

I query the devices on my MacBook Pro using boost::compute::system::devices(); which returns me following list of devices.

Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
Intel(R) UHD Graphics 630
AMD Radeon Pro 560X Compute Engine

I want to use AMD Radeon Pro 560X Compute Engine for my purpose but when I iterate to find the device with maximum rating = CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS . I get the following results:

Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, 
freq: 2600, compute units: 12, rating:31200

Intel(R) UHD Graphics 630, 
freq: 1150, units: 24, rating:27600

AMD Radeon Pro 560X Compute Engine, 
freq: 300, units: 16, rating:4800

AMD GPU has the lowest rating. Also I looked into the specs and it seems to me that CL_DEVICE_MAX_CLOCK_FREQUENCY isn't returning correct value.

According to AMD Chip specs https://www.amd.com/en/products/graphics/radeon-rx-560x , my AMD GPU has base frequency of 1175 MHz, not 300MHz .

According to Intel Chip specs https://en.wikichip.org/wiki/intel/uhd_graphics/630 , my Intel GPU has base frequency of 300 MHz, not 1150MHz , but it does have a boost frequency of 1150MHz

std::vector<boost::compute::device> devices = boost::compute::system::devices();

std::pair<boost::compute::device, ai::int64> suitableDevice{};

for(auto& device: devices)
{
    auto rating = device.clock_frequency() * device.compute_units();
    std::cout << device.name() << ", freq: " << device.clock_frequency() << ", units: " << device.compute_units() << ", rating:" << rating << std::endl;
    if(suitableDevice.second < benchmark)
    {
        suitableDevice.first = device;
        suitableDevice.second = benchmark; 
     }
}

Am I doing anything wrong?

Answer 1

Those properties are unfortunately only really directly comparable within an implementation (same HW manufacturer, same OS).

My recommendation would be to:

First filter out anything with a device type other than CL_DEVICE_TYPE_GPU (unless there aren't any GPUs available, in which case you may want to fall back to CPU).
Check any other important device properties. For example, if you need support for a particular OpenCL version or extension, or if you need especially large work groups or local memory, check all remaining devices and filter out any that can't run your code.
Test whether any of the remaining devices return true for the CL_DEVICE_HOST_UNIFIED_MEMORY property. These will be integrated GPUs, and these are usually slower than discrete ones, unless you are bound by data transfer speeds, in which case they might be faster. So you'll want to prefer one type over the other.
If you're still left with more than one device after that, you can apply your existing heuristic.

Answer 2

This code will return the device with the most floating-point performance

select_device_with_most_flops(find_devices());

and this the device with the most memory

select_device_with_most_memory(find_devices());

At first, find_devices() returns a vector of all OpenCL devices in the system. select_device_with_most_memory() is straightforward and uses getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>() .

Floating-point performance is given by this equation: FLOPs/s = cores/CU * CUs * IPC * clock frequency

select_device_with_most_flops() is more difficult, because OpenCL does only provide the number of compute units (CUs) getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>() , which for a CPU is the number of threads and for a GPU has to be multiplied by the number of stream processors / cuda cores per CU , which is different for Nvidia, AMD and Intel as well as their different microarchitectures and is usually between 4 and 128. Luckily, the vendor is included in getInfo<CL_DEVICE_VENDOR>() . So based on the vendor and the amount of CUs one can figure out the number of cores per CU.

The next part is the FP32 IPC or instructions per clock . For most GPUs, this is 2, while for recent CPUs this is 32, see https://en.wikipedia.org/wiki/FLOPS?oldformat=true#FLOPs_per_cycle_for_various_processors There is no way to figure out the IPC in OpenCL directly, so the 32 for CPUs is just a guess. One could use the device name and a lookup table to be more accurate. getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU will result in true if the device is a GPU.

The final part is the clock frequency. OpenCL provides the base clock frequency in MHz by getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>() . The device can boost higher frequencies, so this again is an approximation.

All of it together gives an estimation for the floating-point performance. The full code is shown below:

typedef unsigned int uint;
string trim(const string s) { // removes whitespace characters from beginnig and end of string s
    const int l = (int)s.length();
    int a=0, b=l-1;
    char c;
    while(a<l && ((c=s.at(a))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) a++;
    while(b>a && ((c=s.at(b))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) b--;
    return s.substr(a, 1+b-a);
}
bool contains(const string s, const string match) {
    return s.find(match)!=string::npos;
}
vector<Device> find_devices() {
    vector<Platform> platforms; // get all platforms (drivers)
    vector<Device> devices_available;
    vector<Device> devices; // get all devices of all platforms
    Platform::get(&platforms);
    if(platforms.size()==0) print_error("There are no OpenCL devices available. Make sure that the OpenCL 1.2 Runtime for your device is installed. For GPUs it comes by default with the graphics driver, for CPUs it has to be installed separately.");
    for(uint i=0; i<(uint)platforms.size(); i++) {
        devices_available.clear();
        platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &devices_available); // CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU
        if(devices_available.size()==0) continue; // no device of type device_type found in plattform i
        for(uint j=0; j<(uint)devices_available.size(); j++) devices.push_back(devices_available[j]);
    }
    print_device_list(devices);
    return devices;
}
Device select_device_with_most_flops(const vector<Device> devices) { // return device with best floating-point performance
    float best_value = 0.0f;
    uint best_i = 0; // index of fastest device
    for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
        const Device d = devices[i];
        //const string device_name = trim(d.getInfo<CL_DEVICE_NAME>());
        const string device_vendor = trim(d.getInfo<CL_DEVICE_VENDOR>()); // is either Nvidia, AMD or Intel
        const uint device_compute_units = (uint)d.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
        const bool device_is_gpu = d.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
        const uint device_ipc = device_is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
        const uint nvidia = (uint)(contains(device_vendor, "NVIDIA")||contains(device_vendor, "vidia"))*(device_compute_units<=30u?128u:64u); // Nvidia GPUs usually have 128 cores/CU, except Volta/Turing (>30 CUs) which have 64 cores/CU
        const uint amd = (uint)(contains(device_vendor, "AMD")||contains(device_vendor, "ADVANCED")||contains(device_vendor, "dvanced"))*(device_is_gpu?64u:1u); // AMD GCN GPUs usually have 64 cores/CU, AMD CPUs have 1 core/CU
        const uint intel = (uint)(contains(device_vendor, "INTEL")||contains(device_vendor, "ntel"))*(device_is_gpu?8u:1u); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs have 1 core/CU
        const uint device_cores = device_compute_units*(nvidia+amd+intel);
        const uint device_clock_frequency = (uint)d.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
        const float device_tflops = 1E-6f*(float)device_cores*(float)device_ipc*(float)device_clock_frequency; // estimated device floating point performance in TeraFLOPs/s
        if(device_tflops>best_value) { // device_memory>best_value
            best_value = device_tflops; // best_value = device_memory;
            best_i = i; // find index of fastest device
        }
    }
    return devices[best_i];
}
Device select_device_with_most_memory(const vector<Device> devices) { // return device with largest memory capacity
    float best_value = 0.0f;
    uint best_i = 0; // index of fastest device
    for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
        const Device d = devices[i];
        const float device_memory = 1E-3f*(float)(d.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()/1048576ull); // in GB
        if(device_memory>best_value) {
            best_value = device_memory;
            best_i = i; // find index of fastest device
        }
    }
    return devices[best_i];
}
Device select_device_with_id(const vector<Device> devices, const int id) { // return device
    if(id>=0&&id<(int)devices.size()) {
        return devices[id];
    } else {
        print("Your selected device ID ("+to_string(id)+") is wrong.");
        return devices[0]; // is never executed, just to avoid compiler warnings
    }
}

UPDATE: I have now included an improved version of this in a lightweight OpenCL-Wrapper. This correctly calculates the FLOPs for all CPUs and GPUs from the last decade or so: https://github.com/ProjectPhysX/OpenCL-Wrapper

Problem in choosing best available GPU using openCL programmatically

Question

2 answers

solution1
1 2019-12-30 15:13:57

solution2
0 2020-01-03 18:58:23

Problem in choosing best available GPU using openCL programmatically

Question

2 answers

solution1 1 2019-12-30 15:13:57

solution2 0 2020-01-03 18:58:23

solution1
1 2019-12-30 15:13:57

solution2
0 2020-01-03 18:58:23