在CUDA中float1 vs float

Question

I have noticed that there is a float1 struct type in cuda. 我注意到cuda中有一个float1结构类型。 Is there any performance benefit over simple float , for example, in case of using a float array vs float1 array ? 有没有比简单float float1 array性能，例如，在使用float array和float array情况下？

struct __device_builtin__ float1
{
    float x;
};

In float4 there is a performance benefit, depending on the occasion, since the alignment is 4x4bytes = 16bytes. 在float4 ，根据场合有一个性能优势，因为对齐是4x4bytes = 16bytes。 Is it just for special usage in __device__ functions with float1 parameters? 它只是用于带有float1参数的__device__函数的特殊用法吗？

Thanks in advance. 提前致谢。

Answer 1

Following @talonmies' comment to the post CUDA Thrust reduction with double2 arrays , I have compared the calculation of the norm of a vector using CUDA Thrust and switching between float and float1 . 在@talonmies对使用double2数组的CUDA推力减少后的评论之后，我比较了使用CUDA Thrust和在float和float1之间切换的向量范数的计算。 I have considered an array of N=1000000 elements on a GT210 card (cc 1.2). 我在GT210卡（cc 1.2）上考虑了N=1000000元素的数组。 It seems that the calculation of the norm takes exactly the same time for both the cases, namely about 3.4s , so there is no performance improvement. 似乎两种情况下的规范计算完全相同，即大约3.4s ，因此没有性能改进。 As it appears from the code below, perhaps float is slightly more confortable in use than float1 . 正如从下面的代码中看到的那样， float在使用中可能比float1更加舒适。

Finally, notice that the advantage of float4 stems from the alignment __builtin__align__ , rather than __device_builtin__ . 最后，请注意float4的优势源于对齐__builtin__align__ ，而不是__device_builtin__ 。

#include <thrust\device_vector.h>
#include <thrust\transform_reduce.h>

struct square
{
    __host__ __device__ float operator()(float x)
    {
        return x * x;
    }
};

struct square1
{
    __host__ __device__ float operator()(float1 x)
    {
        return x.x * x.x;
    }
};

void main() {

    const int N = 1000000;

    float time;
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    thrust::device_vector<float> d_vec(N,3.f);

    cudaEventRecord(start, 0);
    float reduction = sqrt(thrust::transform_reduce(d_vec.begin(), d_vec.end(), square(), 0.0f, thrust::plus<float>()));
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);
    printf("Elapsed time reduction:  %3.1f ms \n", time);

    printf("Result of reduction = %f\n",reduction);

    thrust::host_vector<float1>   h_vec1(N);
    for (int i=0; i<N; i++) h_vec1[i].x = 3.f;
    thrust::device_vector<float1> d_vec1=h_vec1;

    cudaEventRecord(start, 0);
    float reduction1 = sqrt(thrust::transform_reduce(d_vec1.begin(), d_vec1.end(), square1(), 0.0f, thrust::plus<float>()));
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);
    printf("Elapsed time reduction1:  %3.1f ms \n", time);

    printf("Result of reduction1 = %f\n",reduction1);

    getchar();

}

在CUDA中float1 vs float

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-06-12 21:13:22

在CUDA中float1 vs float

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-06-12 21:13:22

解决方案1
1 已采纳 2014-06-12 21:13:22