[英]Problem in choosing best available GPU using openCL programmatically
I'm using the advice given here for choosing an optimal GPU for my algorithm.我正在使用此处给出的建议为我的算法选择最佳 GPU。 https://stackoverflow.com/a/33488953/5371117
https://stackoverflow.com/a/33488953/5371117
I query the devices on my MacBook Pro using boost::compute::system::devices();
我使用
boost::compute::system::devices();
在我的 MacBook Pro 上查询设备。 which returns me following list of devices.它返回我以下设备列表。
Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
Intel(R) UHD Graphics 630
AMD Radeon Pro 560X Compute Engine
I want to use AMD Radeon Pro 560X Compute Engine
for my purpose but when I iterate to find the device with maximum rating = CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS
.我想为我的目的使用
AMD Radeon Pro 560X Compute Engine
,但是当我迭代以找到最大评级= CL_DEVICE_MAX_CLOCK_FREQUENCY * CL_DEVICE_MAX_COMPUTE_UNITS
的设备时。 I get the following results:我得到以下结果:
Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz,
freq: 2600, compute units: 12, rating:31200
Intel(R) UHD Graphics 630,
freq: 1150, units: 24, rating:27600
AMD Radeon Pro 560X Compute Engine,
freq: 300, units: 16, rating:4800
AMD GPU has the lowest rating. AMD GPU 的评分最低。 Also I looked into the specs and it seems to me that
CL_DEVICE_MAX_CLOCK_FREQUENCY
isn't returning correct value.我还查看了规范,在我看来
CL_DEVICE_MAX_CLOCK_FREQUENCY
没有返回正确的值。
According to AMD Chip specs https://www.amd.com/en/products/graphics/radeon-rx-560x , my AMD GPU has base frequency of 1175 MHz, not 300MHz .根据 AMD 芯片规格https://www.amd.com/en/products/graphics/radeon-rx-560x ,我的 AMD GPU 的基本频率为 1175 MHz,而不是 300MHz 。
According to Intel Chip specs https://en.wikichip.org/wiki/intel/uhd_graphics/630 , my Intel GPU has base frequency of 300 MHz, not 1150MHz , but it does have a boost frequency of 1150MHz根据英特尔芯片规格https://en.wikichip.org/wiki/intel/uhd_graphics/630 ,我的英特尔 GPU 的基本频率为 300 MHz,而不是 1150MHz ,但它的升压频率为 1150MHz
std::vector<boost::compute::device> devices = boost::compute::system::devices();
std::pair<boost::compute::device, ai::int64> suitableDevice{};
for(auto& device: devices)
{
auto rating = device.clock_frequency() * device.compute_units();
std::cout << device.name() << ", freq: " << device.clock_frequency() << ", units: " << device.compute_units() << ", rating:" << rating << std::endl;
if(suitableDevice.second < benchmark)
{
suitableDevice.first = device;
suitableDevice.second = benchmark;
}
}
Am I doing anything wrong?我做错什么了吗?
Those properties are unfortunately only really directly comparable within an implementation (same HW manufacturer, same OS).不幸的是,这些属性只能在实现中真正直接比较(相同的硬件制造商,相同的操作系统)。
My recommendation would be to:我的建议是:
CL_DEVICE_TYPE_GPU
(unless there aren't any GPUs available, in which case you may want to fall back to CPU).CL_DEVICE_TYPE_GPU
的任何东西(除非没有任何可用的 GPU,在这种情况下你可能想回退到 CPU)。CL_DEVICE_HOST_UNIFIED_MEMORY
property.CL_DEVICE_HOST_UNIFIED_MEMORY
属性返回 true。 These will be integrated GPUs, and these are usually slower than discrete ones, unless you are bound by data transfer speeds, in which case they might be faster.This code will return the device with the most floating-point performance此代码将返回具有最高浮点性能的设备
select_device_with_most_flops(find_devices());
and this the device with the most memory这是内存最多的设备
select_device_with_most_memory(find_devices());
At first, find_devices()
returns a vector of all OpenCL devices in the system.首先,
find_devices()
返回系统中所有 OpenCL 设备的向量。 select_device_with_most_memory()
is straightforward and uses getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()
. select_device_with_most_memory()
很简单,使用getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()
。
Floating-point performance is given by this equation: FLOPs/s = cores/CU * CUs * IPC * clock frequency浮点性能由以下等式给出: FLOPs/s = cores/CU * CUs * IPC * 时钟频率
select_device_with_most_flops()
is more difficult, because OpenCL does only provide the number of compute units (CUs) getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>()
, which for a CPU is the number of threads and for a GPU has to be multiplied by the number of stream processors / cuda cores per CU , which is different for Nvidia, AMD and Intel as well as their different microarchitectures and is usually between 4 and 128. Luckily, the vendor is included in getInfo<CL_DEVICE_VENDOR>()
. select_device_with_most_flops()
比较困难,因为 OpenCL 只提供计算单元 (CU) 的数量getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>()
,对于 CPU 来说是线程数,对于 GPU 来说必须乘以流处理器的数量/ cuda cores per CU ,这对于 Nvidia、AMD 和 Intel 以及它们不同的微架构是不同的,通常在 4 到 128 之间。幸运的是,供应商包含在getInfo<CL_DEVICE_VENDOR>()
中。 So based on the vendor and the amount of CUs one can figure out the number of cores per CU.因此,根据供应商和 CU 的数量,可以计算出每个 CU 的核心数。
The next part is the FP32 IPC or instructions per clock .下一部分是 FP32 IPC 或每时钟指令。 For most GPUs, this is 2, while for recent CPUs this is 32, see https://en.wikipedia.org/wiki/FLOPS?oldformat=true#FLOPs_per_cycle_for_various_processors There is no way to figure out the IPC in OpenCL directly, so the 32 for CPUs is just a guess.
对于大多数 GPU,这是 2,而对于最近的 CPU,这是 32,请参阅https://en.wikipedia.org/wiki/FLOPS?oldformat=true#FLOPs_per_cycle_for_various_processors没有办法直接在 OpenCL 中找出 IPC,所以CPU 的 32 只是一个猜测。 One could use the device name and a lookup table to be more accurate.
可以使用设备名称和查找表来更准确。
getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU
will result in true if the device is a GPU.如果设备是 GPU,
getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU
将返回 true。
The final part is the clock frequency.最后一部分是时钟频率。 OpenCL provides the base clock frequency in MHz by
getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()
. OpenCL 通过
getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>()
提供以 MHz 为单位的基本时钟频率。 The device can boost higher frequencies, so this again is an approximation.该设备可以提升更高的频率,因此这又是一个近似值。
All of it together gives an estimation for the floating-point performance.所有这些一起给出了对浮点性能的估计。 The full code is shown below:
完整代码如下所示:
typedef unsigned int uint;
string trim(const string s) { // removes whitespace characters from beginnig and end of string s
const int l = (int)s.length();
int a=0, b=l-1;
char c;
while(a<l && ((c=s.at(a))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) a++;
while(b>a && ((c=s.at(b))==' '||c=='\t'||c=='\n'||c=='\v'||c=='\f'||c=='\r'||c=='\0')) b--;
return s.substr(a, 1+b-a);
}
bool contains(const string s, const string match) {
return s.find(match)!=string::npos;
}
vector<Device> find_devices() {
vector<Platform> platforms; // get all platforms (drivers)
vector<Device> devices_available;
vector<Device> devices; // get all devices of all platforms
Platform::get(&platforms);
if(platforms.size()==0) print_error("There are no OpenCL devices available. Make sure that the OpenCL 1.2 Runtime for your device is installed. For GPUs it comes by default with the graphics driver, for CPUs it has to be installed separately.");
for(uint i=0; i<(uint)platforms.size(); i++) {
devices_available.clear();
platforms[i].getDevices(CL_DEVICE_TYPE_ALL, &devices_available); // CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU
if(devices_available.size()==0) continue; // no device of type device_type found in plattform i
for(uint j=0; j<(uint)devices_available.size(); j++) devices.push_back(devices_available[j]);
}
print_device_list(devices);
return devices;
}
Device select_device_with_most_flops(const vector<Device> devices) { // return device with best floating-point performance
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
//const string device_name = trim(d.getInfo<CL_DEVICE_NAME>());
const string device_vendor = trim(d.getInfo<CL_DEVICE_VENDOR>()); // is either Nvidia, AMD or Intel
const uint device_compute_units = (uint)d.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>(); // compute units (CUs) can contain multiple cores depending on the microarchitecture
const bool device_is_gpu = d.getInfo<CL_DEVICE_TYPE>()==CL_DEVICE_TYPE_GPU;
const uint device_ipc = device_is_gpu?2u:32u; // IPC (instructions per cycle) is 2 for GPUs and 32 for most modern CPUs
const uint nvidia = (uint)(contains(device_vendor, "NVIDIA")||contains(device_vendor, "vidia"))*(device_compute_units<=30u?128u:64u); // Nvidia GPUs usually have 128 cores/CU, except Volta/Turing (>30 CUs) which have 64 cores/CU
const uint amd = (uint)(contains(device_vendor, "AMD")||contains(device_vendor, "ADVANCED")||contains(device_vendor, "dvanced"))*(device_is_gpu?64u:1u); // AMD GCN GPUs usually have 64 cores/CU, AMD CPUs have 1 core/CU
const uint intel = (uint)(contains(device_vendor, "INTEL")||contains(device_vendor, "ntel"))*(device_is_gpu?8u:1u); // Intel integrated GPUs usually have 8 cores/CU, Intel CPUs have 1 core/CU
const uint device_cores = device_compute_units*(nvidia+amd+intel);
const uint device_clock_frequency = (uint)d.getInfo<CL_DEVICE_MAX_CLOCK_FREQUENCY>(); // in MHz
const float device_tflops = 1E-6f*(float)device_cores*(float)device_ipc*(float)device_clock_frequency; // estimated device floating point performance in TeraFLOPs/s
if(device_tflops>best_value) { // device_memory>best_value
best_value = device_tflops; // best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_most_memory(const vector<Device> devices) { // return device with largest memory capacity
float best_value = 0.0f;
uint best_i = 0; // index of fastest device
for(uint i=0; i<(uint)devices.size(); i++) { // find device with highest (estimated) floating point performance
const Device d = devices[i];
const float device_memory = 1E-3f*(float)(d.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>()/1048576ull); // in GB
if(device_memory>best_value) {
best_value = device_memory;
best_i = i; // find index of fastest device
}
}
return devices[best_i];
}
Device select_device_with_id(const vector<Device> devices, const int id) { // return device
if(id>=0&&id<(int)devices.size()) {
return devices[id];
} else {
print("Your selected device ID ("+to_string(id)+") is wrong.");
return devices[0]; // is never executed, just to avoid compiler warnings
}
}
UPDATE: I have now included an improved version of this in a lightweight OpenCL-Wrapper.更新:我现在在轻量级 OpenCL-Wrapper 中包含了一个改进版本。 This correctly calculates the FLOPs for all CPUs and GPUs from the last decade or so: https://github.com/ProjectPhysX/OpenCL-Wrapper
这可以正确计算过去十年左右所有 CPU 和 GPU 的 FLOP: https ://github.com/ProjectPhysX/OpenCL-Wrapper
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.