简体   繁体   English

并发流中的CUDA cuFFT API行为

[英]CUDA cuFFT API behavior in concurrent streams

I'm using CUDA 7.0 with a nVidia 980 GTX for some image processing. 我正在使用带有nVidia 980 GTX的CUDA 7.0进行某些图像处理。 In a particular iteration, multiple tiles are processed independently via 15-20 kernel calls and multiple cuFFT FFT/IFFT API calls. 在特定的迭代中,可以通过15-20个内核调用和多个cuFFT FFT / IFFT API调用来独立处理多个图块。

Because of this, I've placed each tile within it's own CUDA stream so each tile executes it's string of operations asynchronously with respect to the host. 因此,我将每个磁贴放置在自己的CUDA流中,以便每个磁贴相对于主机异步执行其操作字符串。 Each tile is the same size within an iteration so they share a cuFFT plan. 每个小块在一次迭代中的大小相同,因此它们共享一个cuFFT计划。 The host thread moves through the commands quickly in an attempt to keep the GPU loaded with work. 主机线程会快速浏览命令,以使GPU保持正常工作。 I'm experiencing a periodic race condition while these operations are being processed in parallel though and had a question about cuFFT in particular. 但是,在并行处理这些操作时,我遇到了周期性竞争条件,尤其是关于cuFFT的问题。 If I place a cuFFT plan in a stream 0 using cuFFTSetStream() for tile 0, and the FFT for tile 0 hasn't actually been executed on the GPU yet before the host sets the shared cuFFT plan's stream to stream 1 for tile 1 before it issues tile 1's work on the GPU, what is the behavior of cuFFTExec() for this plan? 如果我使用图块0的cuFFTSetStream()在流0中放置cuFFT计划,并且在主机将共享cuFFT计划的流设置为图块1的流1之前,尚未在GPU上实际执行图块0的FFT,它在GPU上发布了tile 1的工作,此计划的cuFFTExec()的行为是什么?

More succinctly, does a call to cufftExec() execute in the stream the plan was set to at the time of the cufftExec() call regardless if cuFFTSetStream() is used to change the stream for subsequent tiles before the previous FFT calls have actually begun/completed? 更简洁地说,对cufftExec()的调用是否在cufftExec()调用时在计划已设置的流中执行,而不管是否使用cuFFTSetStream()更改了之前的FFT调用实际开始之前的后续图块的流/完成?

I apologize for not posting code, but I'm not able to post my actual source. 抱歉,我没有发布代码,但无法发布我的实际信息。

EDIT: As pointed out in the comments, if the same plan (same created handle) is used for simultaneous FFT execution on the same device via streams, then the user is responsible for managing separate work areas for each usage of such plan . 编辑:正如评论中指出的那样,如果将同一计划(相同创建的句柄)用于通过流在同一设备上同时执行FFT,则用户应对这种计划的每次使用负责管理单独的工作区 The question seemed to have a focus on the stream behavior itself, and my remaining answer focuses on that as well, but this is an important point. 这个问题似乎集中在流行为本身上,我剩下的答案也集中在此上,但这是重要的一点。

If I place a cuFFT plan in a stream 0 using cuFFTSetStream() for tile 0, and the FFT for tile 0 hasn't actually been executed on the GPU yet before the host sets the shared cuFFT plan's stream to stream 1 for tile 1 before it issues tile 1's work on the GPU, what is the behavior of cuFFTExec() for this plan? 如果我使用图块0的cuFFTSetStream()在流0中放置cuFFT计划,并且在主机将共享cuFFT计划的流设置为图块1的流1之前,尚未在GPU上实际执行图块0的FFT,它在GPU上发布了tile 1的工作,此计划的cuFFTExec()的行为是什么?

Let me pretend you said stream 1 and stream 2, just so we can avoid any possible confusion around the NULL stream. 让我假装您说的是流1和流2,这样我们就可以避免在NULL流周围引起任何混乱。

CUFFT should respect the stream that was defined for the plan at the time the plan was passed to CUFFT via cufftExecXXX() . 当计划通过cufftExecXXX()传递给CUFFT时,CUFFT应该尊重为计划定义的流。 Subsequent changes to the plan via cufftSetStream() should have no effect on the stream used for previously issued cufftExecXXX() calls. 通过cufftSetStream()对计划进行的后续更改将不会影响先前发出的cufftExecXXX()调用所使用的流。

We can verify this with a fairly simple test, using the profiler. 我们可以使用探查器通过相当简单的测试来验证这一点。 Consider the following test code: 考虑以下测试代码:

$ cat t1089.cu
// NOTE: this code omits independent work-area handling for each plan
// which is necessary for a plan that will be shared between streams
// and executed concurrently
#include <cufft.h>
#include <assert.h>
#include <nvToolsExt.h>

#define DSIZE 1048576
#define BATCH 100

int main(){

  const int nx = DSIZE;
  const int nb = BATCH;
  size_t ws = 0;
  cufftHandle plan;
  cufftResult res = cufftCreate(&plan);
  assert(res == CUFFT_SUCCESS);
  res = cufftMakePlan1d(plan, nx, CUFFT_C2C, nb, &ws);
  assert(res == CUFFT_SUCCESS);
  cufftComplex *d;
  cudaMalloc(&d, nx*nb*sizeof(cufftComplex));
  cudaMemset(d, 0, nx*nb*sizeof(cufftComplex));
  cudaStream_t s1, s2;
  cudaStreamCreate(&s1);
  cudaStreamCreate(&s2);
  res = cufftSetStream(plan, s1);
  assert(res == CUFFT_SUCCESS);
  res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
  assert(res == CUFFT_SUCCESS);
  res = cufftSetStream(plan, s2);
  assert(res == CUFFT_SUCCESS);
  nvtxMarkA("plan stream change");
  res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
  assert(res == CUFFT_SUCCESS);
  cudaDeviceSynchronize();
  return 0;
}


$ nvcc -o t1089 t1089.cu -lcufft -lnvToolsExt
$ cuda-memcheck ./t1089
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$

We're just doing two forward FFTs in a row, switching streams in-between the two. 我们只是连续执行两个前向FFT,在两者之间切换流。 We'll use an nvtx marker to clearly identify the point at which the plan stream association change request occurs. 我们将使用nvtx 标记清楚地标识出计划流关联更改请求发生的时间点。 Now let's look at the nvprof --print-api-trace output (removing the lengthy start-up preamble): 现在让我们看一下nvprof --print-api-trace输出(删除冗长的启动前导):

983.84ms  617.00us  cudaMalloc
984.46ms  21.628us  cudaMemset
984.48ms  37.546us  cudaStreamCreate
984.52ms  121.34us  cudaStreamCreate
984.65ms     995ns  cudaPeekAtLastError
984.67ms     996ns  cudaConfigureCall
984.67ms     517ns  cudaSetupArgument
984.67ms  21.908us  cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416])
984.69ms     349ns  cudaGetLastError
984.69ms     203ns  cudaPeekAtLastError
984.70ms     296ns  cudaConfigureCall
984.70ms     216ns  cudaSetupArgument
984.70ms  8.8920us  cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421])
984.71ms     272ns  cudaGetLastError
984.71ms     177ns  cudaPeekAtLastError
984.72ms     314ns  cudaConfigureCall
984.72ms     229ns  cudaSetupArgument
984.72ms  9.9230us  cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426])
984.73ms     295ns  cudaGetLastError
984.77ms         -  [Marker] plan stream change
984.77ms     434ns  cudaPeekAtLastError
984.78ms     357ns  cudaConfigureCall
984.78ms     228ns  cudaSetupArgument
984.78ms  10.642us  cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431])
984.79ms     287ns  cudaGetLastError
984.79ms     193ns  cudaPeekAtLastError
984.80ms     293ns  cudaConfigureCall
984.80ms     208ns  cudaSetupArgument
984.80ms  7.7620us  cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436])
984.81ms     297ns  cudaGetLastError
984.81ms     178ns  cudaPeekAtLastError
984.81ms     269ns  cudaConfigureCall
984.81ms     214ns  cudaSetupArgument
984.81ms  7.4130us  cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441])
984.82ms     312ns  cudaGetLastError
984.82ms  152.63ms  cudaDeviceSynchronize
$

We see that each FFT operation requires 3 kernel calls. 我们看到每个FFT操作都需要3个内核调用。 In between, we see our nvtx marker indicating when the request for a plan stream change was made, and it's no surprise that this takes place after the first 3 kernel launches, but before the last 3. Finally, we note that essentially all of the execution time is absorbed in the final cudaDeviceSynchronize() call. 在这两者之间,我们看到nvtx标记指示何时更改计划流,这并不奇怪,这发生在前三个内核启动之后,而后三个内核启动之前。最后,我们注意到基本上所有执行时间将在最终的cudaDeviceSynchronize()调用中吸收。 All of the preceding calls are asynchronous and so execute more-or-less "immediately" in the first millisecond of execution. 前面的所有调用都是异步的,因此在执行的第一毫秒内或多或少地“立即”执行。 The final synchronize absorbs all the processing time of the 6 kernels, amounting to about 150 milliseconds. 最终同步吸收了6个内核的所有处理时间,总计约150毫秒。

So if the cufftSetStream were to have an effect on the first iteration of the cufftExecC2C() call, we would expect to see some or all of the first 3 kernels launched into the same stream as that used for the last 3 kernels. 因此,如果cufftSetStreamcufftExecC2C()调用的第一次迭代cufftExecC2C() ,我们希望可以看到前三个内核中的某些或全部启动到与后三个内核相同的流中。 But when we look at the nvprof --print-gpu-trace output: 但是,当我们看nvprof --print-gpu-trace输出时:

$ nvprof --print-gpu-trace ./t1089
==3757== NVPROF is profiling process 3757, command: ./t1089
==3757== Profiling application: ./t1089
==3757== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
974.74ms  7.3440ms                    -               -         -         -         -  800.00MB  106.38GB/s  Quadro 5000 (0)         1         7  [CUDA memset]
982.09ms  23.424ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416]
1.00551s  21.172ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421]
1.02669s  27.551ms          (25600 1 1)       (16 16 1)        61  17.000KB        0B         -           -  Quadro 5000 (0)         1        13  void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426]
1.05422s  23.592ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431]
1.07781s  21.157ms          (25600 2 1)        (32 8 1)        32  8.0000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436]
1.09897s  27.913ms          (25600 1 1)       (16 16 1)        61  17.000KB        0B         -           -  Quadro 5000 (0)         1        14  void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

we see that in fact the first 3 kernels are issued into the first stream, and the last 3 kernels are issued into the second stream, just as requested. 我们看到实际上,前3个内核已按要求发布到了第一个流中,而后3个内核已按要求发布到了第二个流中。 (And the sum total execution time of all kernels is approximately 150ms, just as suggested by the api trace output.) Since the underlying kernel launches are asynchronous and are issued prior to the return of the cufftExecC2C() call, if you think about this carefully you'll come to the conclusion that it has to be this way. (正如api跟踪输出所建议的那样,所有内核的总执行时间约为cufftExecC2C() 。)由于底层内核启动是异步的,并且在cufftExecC2C()调用返回之前发出,所以仔细地,您会得出结论,必须采用这种方式。 The stream to launch a kernel into is specified at kernel launch time. 在内核启动时指定启动内核的流。 (And of course I think this is considered "preferred" behavior.) (当然,我认为这是“首选”行为。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM