以編程方式使用 openCL 選擇最佳可用 GPU 的問題

Question

我正在使用此處給出的建議為我的算法選擇最佳 GPU。 https://stackoverflow.com/a/33488953/5371117

我使用boost::compute::system::devices();在我的 MacBook Pro 上查詢設備。 它返回我以下設備列表。

Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
Intel(R) UHD Graphics 630
AMD Radeon Pro 560X Compute Engine

我想為我的目的使用AMD Radeon Pro 560X Compute Engine ，但是當我迭代以找到最大評級= CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS的設備時。 我得到以下結果：

Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz, 
freq: 2600, compute units: 12, rating:31200

Intel(R) UHD Graphics 630, 
freq: 1150, units: 24, rating:27600

AMD Radeon Pro 560X Compute Engine, 
freq: 300, units: 16, rating:4800

AMD GPU 的評分最低。 我還查看了規范，在我看來CL_DEVICE_MAX_CLOCK_FREQUENCY沒有返回正確的值。

根據 AMD 芯片規格https://www.amd.com/en/products/graphics/radeon-rx-560x ，我的 AMD GPU 的基本頻率為 1175 MHz，而不是 300MHz 。

根據英特爾芯片規格https://en.wikichip.org/wiki/intel/uhd_graphics/630 ，我的英特爾 GPU 的基本頻率為 300 MHz，而不是 1150MHz ，但它的升壓頻率為 1150MHz

std::vector<boost::compute::device> devices = boost::compute::system::devices();

std::pair<boost::compute::device, ai::int64> suitableDevice{};

for(auto& device: devices)
{
    auto rating = device.clock_frequency() * device.compute_units();
    std::cout << device.name() << ", freq: " << device.clock_frequency() << ", units: " << device.compute_units() << ", rating:" << rating << std::endl;
    if(suitableDevice.second < benchmark)
    {
        suitableDevice.first = device;
        suitableDevice.second = benchmark; 
     }
}

我做錯什么了嗎？

Answer 1

不幸的是，這些屬性只能在實現中真正直接比較（相同的硬件制造商，相同的操作系統）。

我的建議是：

首先過濾掉設備類型不是CL_DEVICE_TYPE_GPU的任何東西（除非沒有任何可用的 GPU，在這種情況下你可能想回退到 CPU）。
檢查任何其他重要的設備屬性。 例如，如果您需要對特定 OpenCL 版本或擴展的支持，或者如果您需要特別大的工作組或本地內存，請檢查所有剩余設備並過濾掉任何無法運行您的代碼的設備。
測試任何剩余設備是否為CL_DEVICE_HOST_UNIFIED_MEMORY屬性返回 true。 這些將是集成 GPU，它們通常比離散 GPU 慢，除非您受數據傳輸速度的限制，在這種情況下它們可能會更快。 所以你會想要一種類型而不是另一種類型。
如果在那之后您仍然有多個設備，您可以應用現有的啟發式方法。

Answer 2

此代碼將返回具有最高浮點性能的設備

select_device_with_most_flops(find_devices());

這是內存最多的設備

select_device_with_most_memory(find_devices());

首先， find_devices()返回系統中所有 OpenCL 設備的向量。 select_device_with_most_memory()很簡單，使用getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>() 。

浮點性能由以下等式給出： FLOPs/s = cores/CU * CUs * IPC * 時鍾頻率

select_device_with_most_flops()比較困難，因為 OpenCL 只提供計算單元 (CU) 的數量getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>() ，對於 CPU 來說是線程數，對於 GPU 來說必須乘以流處理器的數量/ cuda cores per CU ，這對於 Nvidia、AMD 和 Intel 以及它們不同的微架構是不同的，通常在 4 到 128 之間。幸運的是，供應商包含在getInfo<CL_DEVICE_VENDOR>()中。 因此，根據供應商和 CU 的數量，可以計算出每個 CU 的核心數。

下一部分是 FP32 IPC 或每時鍾指令。 對於大多數 GPU，這是 2，而對於最近的 CPU，這是 32，請參閱https://en.wikipedia.org/wiki/FLOPS?oldformat=true#FLOPs_per_cycle_for_various_processors沒有辦法直接在 OpenCL 中找出 IPC，所以CPU 的 32 只是一個猜測。 可以使用設備名稱和查找表來更准確。 如果設備是 GPU， getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU將返回 true。

最后一部分是時鍾頻率。 OpenCL 通過getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()提供以 MHz 為單位的基本時鍾頻率。 該設備可以提升更高的頻率，因此這又是一個近似值。

所有這些一起給出了對浮點性能的估計。 完整代碼如下所示：

typedef unsigned int uint;
string trim(const string s) { // removes whitespace characters from beginnig and end of string s
    const int l = (int)s.length();
    int a=0, b=l-1;
    char c;
    while(a<l && ((c=s.at(a))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) a++;
    while(b>a && ((c=s.at(b))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) b--;
    return s.substr(a, 1+b-a);
}
bool contains(const string s, const string match) {
    return s.find(match)!=string::npos;
}
vector<Device> find_devices() {
    vector<Platform> platforms; // get all platforms (drivers)
    vector<Device> devices_available;
    vector<Device> devices; // get all devices of all platforms
    Platform::get(&platforms);
    if(platforms.size()==0) print_error("There are no OpenCL devices available. Make sure that the OpenCL 1.2 Runtime for your device is installed. For GPUs it comes by default with the graphics driver, for CPUs it has to be installed separately.");
    for(uint i=0; i<(uint)platforms.size(); i++) {
        devices_available.clear();
        platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &devices_available); // CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU
        if(devices_available.size()==0) continue; // no device of type device_type found in plattform i
        for(uint j=0; j<(uint)devices_available.size(); j++) devices.push_back(devices_available[j]);
    }
    print_device_list(devices);
    return devices;
}
Device select_device_with_most_flops(const vector<Device> devices) { // return device with best floating-point performance
    float best_value = 0.0f;
    uint best_i = 0; // index of fastest device
    for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
        const Device d = devices[i];
        //const string device_name = trim(d.getInfo<CL_DEVICE_NAME>());
        const string device_vendor = trim(d.getInfo<CL_DEVICE_VENDOR>()); // is either Nvidia, AMD or Intel
        const uint device_compute_units = (uint)d.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
        const bool device_is_gpu = d.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
        const uint device_ipc = device_is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
        const uint nvidia = (uint)(contains(device_vendor, "NVIDIA")||contains(device_vendor, "vidia"))*(device_compute_units<=30u?128u:64u); // Nvidia GPUs usually have 128 cores/CU, except Volta/Turing (>30 CUs) which have 64 cores/CU
        const uint amd = (uint)(contains(device_vendor, "AMD")||contains(device_vendor, "ADVANCED")||contains(device_vendor, "dvanced"))*(device_is_gpu?64u:1u); // AMD GCN GPUs usually have 64 cores/CU, AMD CPUs have 1 core/CU
        const uint intel = (uint)(contains(device_vendor, "INTEL")||contains(device_vendor, "ntel"))*(device_is_gpu?8u:1u); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs have 1 core/CU
        const uint device_cores = device_compute_units*(nvidia+amd+intel);
        const uint device_clock_frequency = (uint)d.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
        const float device_tflops = 1E-6f*(float)device_cores*(float)device_ipc*(float)device_clock_frequency; // estimated device floating point performance in TeraFLOPs/s
        if(device_tflops>best_value) { // device_memory>best_value
            best_value = device_tflops; // best_value = device_memory;
            best_i = i; // find index of fastest device
        }
    }
    return devices[best_i];
}
Device select_device_with_most_memory(const vector<Device> devices) { // return device with largest memory capacity
    float best_value = 0.0f;
    uint best_i = 0; // index of fastest device
    for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
        const Device d = devices[i];
        const float device_memory = 1E-3f*(float)(d.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()/1048576ull); // in GB
        if(device_memory>best_value) {
            best_value = device_memory;
            best_i = i; // find index of fastest device
        }
    }
    return devices[best_i];
}
Device select_device_with_id(const vector<Device> devices, const int id) { // return device
    if(id>=0&&id<(int)devices.size()) {
        return devices[id];
    } else {
        print("Your selected device ID ("+to_string(id)+") is wrong.");
        return devices[0]; // is never executed, just to avoid compiler warnings
    }
}

更新：我現在在輕量級 OpenCL-Wrapper 中包含了一個改進版本。 這可以正確計算過去十年左右所有 CPU 和 GPU 的 FLOP： https ://github.com/ProjectPhysX/OpenCL-Wrapper

以編程方式使用 openCL 選擇最佳可用 GPU 的問題

問題描述

2 個解決方案

解決方案1
1 2019-12-30 15:13:57

解決方案2
0 2020-01-03 18:58:23

以編程方式使用 openCL 選擇最佳可用 GPU 的問題

問題描述

2 個解決方案

解決方案1 1 2019-12-30 15:13:57

解決方案2 0 2020-01-03 18:58:23

解決方案1
1 2019-12-30 15:13:57

解決方案2
0 2020-01-03 18:58:23