Tag[cub] Recent Newest Questions

CUB device scan with custom scan op fails

I am using CUB::InclusiveScan which takes a custom binary, non-commutative, operator. When defining my Otherwise, my code is nearly identical to th ...

Getting total execution time of all kernels on a CUDA stream

I know how to time the execution of one CUDA kernel using CUDA events, which is great for simple cases. But in the real world, an algorithm is often m ...

cub::DeviceRadixSort fails when specifying end bit

I am using the GPU radix sort algorithm of the CUB library to sort N 32-bit unsigned integers whose values all utilize only k of their 32 bits, starti ...

CUB sum reduction with 2D pitched arrays

I am trying to perform a sum reduction using CUB and 2D arrays of type float/double. Although it works for certain combinations of rows+columns, for r ...

What is the proper way to enable cub in cupy?

I am trying to figure out the proper way to enable cub in cupy, but without success so far. I looked into the documentation and I couldn't find anythi ...

Why does this CUDA reduction fail if I use 31 blocks?

The following CUDA code takes a list of labels (0, 1, 2, 3, ...) and finds the sums of the weights of these labels. To accelerate the calculation, I ...

How to use cub::DeviceReduce::ArgMin()

I am having some confusions about how to use the cub::DeviceReduce::ArgMin(). Here I copy the code from the documentation of CUB. And the questions ...

What is the usual way to use a modified C++ header-only library in my own open source project?

I want to use a modified C++ header library in my own open source project, but not sure what is the usual way to do it. For example, to use the origi ...

How to compile C++ with CUB library?

I am using the CUB device function just like the example here (https://forums.developer.nvidia.com/t/cub-library/37675/2). I was able to compile the . ...

Is there a way to use CUB::BlockScan on oddly sized data arrays?

All the examples perform scans on arrays sized by some multiple of 32. The quickest examples use 256 or more threads with 4 or more elements assigned ...

CUB sort with iterator

I would like to transform values and sort them in one go, like this: However, SortKeys requires raw pointers instead of the iterators. Is it possib ...

dot_product with CUDA_CUB

I have tested the reduction sum(as shown in above code snippet) with cuda cub successfully, I want to perform the inner product of two vectors based ...

CUB reduction using 2D grid of blocks

I'm trying to make a sum using the CUB reduction method. The big problem is: I'm not sure how to return the values of each block to the Host when usi ...

fatal error: cub/cub.cuh: No such file or directory

I am new to CUDA and CUB. I found the following code and tried to compile it, but I had this error: fatal error: cub/cub.cuh: No such file or director ...

Installing CUB in nvidia nsight

I want to use CUB with NVIDIA Nsight. I looked for tutorials on the internet for doing that, but I didn't find anything, even in the official pages pf ...

CUB template similar to thrust

Following is a thrust code: Here, the thrust::reduce takes the first and last input iterator, and thrust returns the value back to the CPU(copied t ...

Incorrect results with CUB ReduceByKey when specifying gencode

In one of my projects, I'm seeing some incorrect results when using CUB's DeviceReduce::ReduceByKey. However, using the same inputs/outputs with thrus ...

Using both CUB and Thrust for parallel sum scan

I am trying to do parallel sum scan on a test vector. I am using both Thrust and CUB library for this purpose The error I am getting is I could ...

maximum supported size for cub library

Does anyone know what is the maximum supported size for cub::scan ? I got core dump for input sizes over 500 million. I wanted to make sure I'm not do ...

How to sort an array of CUDA vector types

Specifically how could I sort an array of float3? Such that the .x components are the primary sort criteria, the .y components are the secondary sort ...