Boost.Compute比普通CPU慢？

Question

我剛開始玩Boost.Compute，看看它能為我們帶來多少速度，我寫了一個簡單的程序：

#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/foreach.hpp>
#include <boost/compute/core.hpp>
#include <boost/compute/platform.hpp>
#include <boost/compute/algorithm.hpp>
#include <boost/compute/container/vector.hpp>
#include <boost/compute/functional/math.hpp>
#include <boost/compute/types/builtin.hpp>
#include <boost/compute/function.hpp>
#include <boost/chrono/include.hpp>

namespace compute = boost::compute;

int main()
{
    // generate random data on the host
    std::vector<float> host_vector(16000);
    std::generate(host_vector.begin(), host_vector.end(), rand);

    BOOST_FOREACH (auto const& platform, compute::system::platforms())
    {
        std::cout << "====================" << platform.name() << "====================\n";
        BOOST_FOREACH (auto const& device, platform.devices())
        {
            std::cout << "device: " << device.name() << std::endl;
            compute::context context(device);
            compute::command_queue queue(context, device);
            compute::vector<float> device_vector(host_vector.size(), context);

            // copy data from the host to the device
            compute::copy(
                host_vector.begin(), host_vector.end(), device_vector.begin(), queue
            );

            auto start = boost::chrono::high_resolution_clock::now();
            compute::transform(device_vector.begin(),
                       device_vector.end(),
                       device_vector.begin(),
                       compute::sqrt<float>(), queue);

            auto ans = compute::accumulate(device_vector.begin(), device_vector.end(), 0, queue);
            auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
            std::cout << "ans: " << ans << std::endl;
            std::cout << "time: " << duration.count() << " ms" << std::endl;
            std::cout << "-------------------\n";
        }
    }
    std::cout << "====================plain====================\n";
    auto start = boost::chrono::high_resolution_clock::now();
    std::transform(host_vector.begin(),
                host_vector.end(),
                host_vector.begin(),
                [](float v){ return std::sqrt(v); });

    auto ans = std::accumulate(host_vector.begin(), host_vector.end(), 0);
    auto duration = boost::chrono::duration_cast<boost::chrono::milliseconds>(boost::chrono::high_resolution_clock::now() - start);
    std::cout << "ans: " << ans << std::endl;
    std::cout << "time: " << duration.count() << " ms" << std::endl;

    return 0;
}

這是我的機器上的示例輸出（win7 64位）：

====================Intel(R) OpenCL====================
device: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
ans: 1931421
time: 64 ms
-------------------
device: Intel(R) HD Graphics 4600
ans: 1931421
time: 64 ms
-------------------
====================NVIDIA CUDA====================
device: Quadro K600
ans: 1931421
time: 4 ms
-------------------
====================plain====================
ans: 1931421
time: 0 ms

我的問題是：為什么普通（非opencl）版本更快？

Answer 1

正如其他人所說的那樣，你的內核中很可能沒有足夠的計算來使得在GPU上運行單組數據是值得的（你受到內核編譯時間和GPU傳輸時間的限制）。

為了獲得更好的性能數字，你應該多次運行算法（並且很可能會丟棄第一個算法，因為它包含編譯和存儲內核的時間，因此會更大）。

此外，不應將transform()和accumulate()作為單獨的操作運行，而應使用融合的transform_reduce()算法，該算法使用單個內核執行轉換和縮減。 代碼如下所示：

float ans = 0;
compute::transform_reduce(
    device_vector.begin(),
    device_vector.end(),
    &ans,
    compute::sqrt<float>(),
    compute::plus<float>(),
    queue
);
std::cout << "ans: " << ans << std::endl;

您還可以使用Boost.Compute和-DBOOST_COMPUTE_USE_OFFLINE_CACHE編譯代碼，這將啟用脫機內核緩存（這需要與boost_filesystem鏈接）。 然后，您使用的內核將存儲在您的文件系統中，並且只在您第一次運行應用程序時進行編譯（默認情況下，Linux上的NVIDIA已經執行此操作）。

Answer 2

我可以看到一個可能的原因造成重大差異。 比較CPU和GPU數據流： -

CPU              GPU

                 copy data to GPU

                 set up compute code

calculate sqrt   calculate sqrt

sum              sum

                 copy data from GPU

鑒於此，看起來英特爾芯片在一般計算上只是有點垃圾，NVidia可能會受到額外數據復制和設置GPU進行計算的困擾。

您應該嘗試相同的程序，但操作更復雜 - sqrt和sum太簡單，無法克服使用GPU的額外開銷。 例如，您可以嘗試計算Mandlebrot點數。

在你的例子中，將lambda移動到累積中會更快（一次通過內存而不是兩次通過）

Answer 3

你得到的結果不好，因為你的測量時間不正確。

OpenCL設備有自己的時間計數器，與主機計數器無關。 每個OpenCL任務都有4個狀態，可以查詢定時器:(來自Khronos網站）

CL_PROFILING_COMMAND_QUEUED ，當事件標識的命令被主機排入命令隊列時
CL_PROFILING_COMMAND_SUBMIT ，當由已排隊的事件標識的命令由主機提交CL_PROFILING_COMMAND_SUBMIT命令隊列關聯的設備時。
CL_PROFILING_COMMAND_START ，當事件標識的命令在設備上開始執行時。
CL_PROFILING_COMMAND_END ，當事件標識的命令在設備上完成執行時。

考慮到，計時器是設備端 。 因此，要測量內核和命令隊列性能，您可以查詢這些計時器。 在您的情況下，需要2個最后的計時器。

在您的示例代碼中，您正在測量主機時間，其中包括數據傳輸時間（如Skizz所述）以及在命令隊列維護上浪費的所有時間。

因此，要了解實際的內核性能，您需要將cl_event傳遞給內核（不知道如何在boost :: compute中執行）並查詢該事件以獲得性能計數器，或者使內核真正龐大而復雜以隱藏所有開銷。

Boost.Compute比普通CPU慢？

問題描述

3 個解決方案

解決方案1
8 2014-06-19 00:43:53

解決方案2
2 2014-06-18 08:47:51

解決方案3
1 2014-06-18 15:21:52

Boost.Compute比普通CPU慢？

問題描述

3 個解決方案

解決方案1 8 2014-06-19 00:43:53

解決方案2 2 2014-06-18 08:47:51

解決方案3 1 2014-06-18 15:21:52

解決方案1
8 2014-06-19 00:43:53

解決方案2
2 2014-06-18 08:47:51

解決方案3
1 2014-06-18 15:21:52