如何從C ++將字符串矩陣傳遞給Cuda內核

Question

問題：

我在C ++中有一個矩陣，里面裝有字符串，我想將其傳遞給cuda內核函數。 我知道CUDA無法處理字符串，因此在進行一些研究之后，我嘗試了下面列出的一些解決方案。

嘗試：

在C ++中定義一個指針數組，該數組包含每個單元格的指針chars（為簡單起見，tmp [i]被包含在前面引用的矩陣中的字符串填充）

C ++部分

  char *tmp[3]; int text_length, array_length; text_length = 4; array_length = 3; tmp[0] = (char*) malloc(text_length*sizeof(char)); tmp[1] = (char*) malloc(text_length*sizeof(char)); tmp[2] = (char*) malloc(text_length*sizeof(char)); tmp[0] = "some"; tmp[1] = "rand"; tmp[2] = "text"; char *a[3]; for(int i=0;i<array_length;i++) { cudaMalloc((void**) &a[i],text_length*sizeof(char)); cudaMemcpy(&a[i],&tmp[i],text_length*sizeof(char),cudaMemcpyHostToDevice); } func<<<blocksPerGrid, threadsPerBlock>>>(a);

CUDA部分

  __global__ void func(char* a[]){ for(int i=0;i<3;i++) printf("value[%d] = %s \\n",i, a[i]); }

產量

  value[0] = (null) value[1] = (null) value[2] = (null)

將填充有字符串的矩陣散布到char指針，並將其傳遞給cuda內核，然后嘗試檢索字符串（再次用C ++簡化代碼）

C ++部分

  char *a; int index[6]; a = "somerandtext"; index[0] = 0; // first word start index[1] = 3; // first word end index[2] = 4; // same as first word index[3] = 7; index[4] = 8; index[5] = 1; func<<<blocksPerGrid, threadsPerBlock>>>(a,index);

CUDA部分

  __global__ void func(char* a,int index[]){ int first_word_start = index[0]; int first_word_end = index[1]; // print first word for(int i=first_word_start;i<=first_word_end;i++) printf("%c",a[i]); }

產量

  no output produced

我已經嘗試了許多其他解決方案，但是沒有一個對我有用。這個問題也可以重新提出來：我如何將n字符串傳遞給cuda內核並在那里打印（並比較）所有它們（請記住，我不能傳遞'n'變量）。

Answer 1

您所顯示的所有代碼都不完整，而您遺漏的內容可能很重要。 如果您顯示完整的代碼，則可以使其他人更輕松地為您提供幫助。 另外，無論何時您在使用CUDA代碼時，都應該使用正確的cuda錯誤檢查，這是一種好習慣，這種檢查通常會指出您不起作用的地方（我懷疑這可能對您的第二次嘗試有所幫助）。 另外，使用cuda-memcheck運行代碼通常是很有啟發性的。

第一次嘗試時，您遇到了CUDA和嵌套指針（ a是指向指針數組的指針）的經典問題。 每當指針埋在某些其他數據結構中時，幾乎也會發生此問題。 要將這種數據結構從主機復制到設備，需要執行“深度復制”操作，該操作具有多個步驟。 要了解更多有關此內容的信息，請搜索“ CUDA 2D數組”（我認為規范的答案是此處的標准答案）或在此處和此處查看我的答案。

另請注意，在CUDA 6中，如果您能夠使用統一內存，則從概念上講，“深層復制”對於程序員而言要容易得多。

您的第二次嘗試似乎是“扁平化” char 2D或指針對點數組的路徑。 這是解決深層復制“問題”的典型解決方案，從而減少了代碼復雜性並可能提高了性能。 這是一個完全有效的示例，融合了您第一次和第二次嘗試的想法，這似乎對我有用：

$ cat t389.cu
#include <stdio.h>

 __global__ void func(char* a, int *indexes, int num_strings){


 for(int i=0;i<num_strings;i++){
   printf("string[%d]: ", i);
   for (int j=indexes[2*i]; j < indexes[2*i+1]; j++)
     printf("%c", a[j]);
   printf("\n");
 }
}

int main(){

 int max_text_length, num_str;
 num_str = 3;
 char *tmp[num_str];
 max_text_length = 12;

 tmp[0] = (char*) malloc(max_text_length*sizeof(char));
 tmp[1] = (char*) malloc(max_text_length*sizeof(char));
 tmp[2] = (char*) malloc(max_text_length*sizeof(char));

 tmp[0] = "some text";
 tmp[1] = "rand txt";
 tmp[2] = "text";

 int stridx[2*num_str];
 int *d_stridx;
 stridx[0] = 0;
 stridx[1] = 9;
 stridx[2] = 9;
 stridx[3] = 17;
 stridx[4] = 17;
 stridx[5] = 21;

 char *a, *d_a;
 a = (char *)malloc(num_str*max_text_length*sizeof(char));
 //flatten
 int subidx = 0;
 for(int i=0;i<num_str;i++)
 {
   for (int j=stridx[2*i]; j<stridx[2*i+1]; j++)
     a[j] = tmp[i][subidx++];
   subidx = 0;
 }

 cudaMalloc((void**)&d_a,num_str*max_text_length*sizeof(char));
 cudaMemcpy(d_a, a,num_str*max_text_length*sizeof(char),cudaMemcpyHostToDevice);
 cudaMalloc((void**)&d_stridx,num_str*2*sizeof(int));
 cudaMemcpy(d_stridx, stridx,2*num_str*sizeof(int),cudaMemcpyHostToDevice);


 func<<<1,1>>>(d_a, d_stridx, num_str);
 cudaDeviceSynchronize();

}
$ nvcc -arch=sm_20 -o t389 t389.cu
$ cuda-memcheck ./t389
========= CUDA-MEMCHECK
string[0]: some text
string[1]: rand txt
string[2]: text
========= ERROR SUMMARY: 0 errors
$

如何從C ++將字符串矩陣傳遞給Cuda內核

問題描述

問題：

嘗試：

1 個解決方案

解決方案1
2 已采納 2014-04-17 15:24:04

如何從C ++將字符串矩陣傳遞給Cuda內核

問題描述

問題：

嘗試：

1 個解決方案

解決方案1 2 已采納 2014-04-17 15:24:04

解決方案1
2 已采納 2014-04-17 15:24:04