[英]float1 vs float in CUDA
I have noticed that there is a float1
struct type in cuda. 我注意到cuda中有一个
float1
结构类型。 Is there any performance benefit over simple float
, for example, in case of using a float array
vs float1 array
? 有没有比简单
float
float1 array
性能,例如,在使用float array
和float array
情况下?
struct __device_builtin__ float1
{
float x;
};
In float4
there is a performance benefit, depending on the occasion, since the alignment is 4x4bytes = 16bytes. 在
float4
,根据场合有一个性能优势,因为对齐是4x4bytes = 16bytes。 Is it just for special usage in __device__
functions with float1
parameters? 它只是用于带有
float1
参数的__device__
函数的特殊用法吗?
Thanks in advance. 提前致谢。
Following @talonmies' comment to the post CUDA Thrust reduction with double2 arrays , I have compared the calculation of the norm of a vector using CUDA Thrust and switching between float
and float1
. 在@talonmies对使用double2数组的CUDA推力减少后的评论之后,我比较了使用CUDA Thrust和在
float
和float1
之间切换的向量范数的计算。 I have considered an array of N=1000000
elements on a GT210 card (cc 1.2). 我在GT210卡(cc 1.2)上考虑了
N=1000000
元素的数组。 It seems that the calculation of the norm takes exactly the same time for both the cases, namely about 3.4s
, so there is no performance improvement. 似乎两种情况下的规范计算完全相同,即大约
3.4s
,因此没有性能改进。 As it appears from the code below, perhaps float
is slightly more confortable in use than float1
. 正如从下面的代码中看到的那样,
float
在使用中可能比float1
更加舒适。
Finally, notice that the advantage of float4
stems from the alignment __builtin__align__
, rather than __device_builtin__
. 最后,请注意
float4
的优势源于对齐__builtin__align__
,而不是__device_builtin__
。
#include <thrust\device_vector.h>
#include <thrust\transform_reduce.h>
struct square
{
__host__ __device__ float operator()(float x)
{
return x * x;
}
};
struct square1
{
__host__ __device__ float operator()(float1 x)
{
return x.x * x.x;
}
};
void main() {
const int N = 1000000;
float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
thrust::device_vector<float> d_vec(N,3.f);
cudaEventRecord(start, 0);
float reduction = sqrt(thrust::transform_reduce(d_vec.begin(), d_vec.end(), square(), 0.0f, thrust::plus<float>()));
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Elapsed time reduction: %3.1f ms \n", time);
printf("Result of reduction = %f\n",reduction);
thrust::host_vector<float1> h_vec1(N);
for (int i=0; i<N; i++) h_vec1[i].x = 3.f;
thrust::device_vector<float1> d_vec1=h_vec1;
cudaEventRecord(start, 0);
float reduction1 = sqrt(thrust::transform_reduce(d_vec1.begin(), d_vec1.end(), square1(), 0.0f, thrust::plus<float>()));
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Elapsed time reduction1: %3.1f ms \n", time);
printf("Result of reduction1 = %f\n",reduction1);
getchar();
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.