如何將 4D 數組的一部分從主機 memory 復制到設備 memory？

Question

我在主機陣列中展平了 4-D 陣列。
我想復制 4-D 數組的一部分（紅色區域），如下圖所示。

我不知道如何復制未序列化的數組。
我復制一部分數組的原因是因為原始數組大小超過 10GB，我只需要它的 10%。
所以一開始，我用for循環嘗試了它。 但這花費了太多時間。
有沒有更好的主意..？

int main(){
    int nx = 100; ny = 200; nz = 300; nch = 400;
    int idx_x_beg = 50;   int_x_end = 100;
    int idx_y_beg = 100;  int_y_end = 200;
    int idx_z_beg = 150;  int_z_end = 300;
    int idx_ch_beg = 200; int_ch_end = 400;

    double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
    double *d_4dArray;
    cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));

    for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
        for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
            for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
                cudaMemcpy(d_4dArray + temp_ch*idx_z_size*idx_y_size*idx_x_size + temp_z*idx_y_size*idx_x_size + temp_y*idx_x_size
                         , h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
                         , sizeof(double)*(int_x_end - int_x_beg), cudaMemcpyHostToDevice)
            }
        }
    }

    return 0;
}

Answer 1

對於復制數組的子集，cuda 提供cudaMemcpy2D （可以復制多維數組的單個 2D 部分）和cudaMemcpy3D （可以復制多維數組的單個 3D 部分）。 您可以在cuda標簽上找到很多問題，以了解如何使用這些問題。

這些方法有兩個問題：

它們不一定擴展到 4D 案例。 即你可能還需要一個循環
這些操作的性能（主機<->設備傳輸速度）通常明顯低於復制相同字節總數的cudaMemcpy操作

所以這里沒有免費的午餐。 我相信最好的建議是在主機上創建一個額外的“連續”緩沖區，將所有切片復制到該緩沖區，然后在單個cudaMemcpy調用中將該緩沖區從主機復制到設備。 之后，如果您仍然需要設備上的 4D 表示，那么您將需要編寫一個設備 kernel 為您“分散”數據。 從概念上講，與您顯示的代碼相反。

抱歉，我不會為您編寫所有代碼。 但是，我將使用您顯示的代碼粗略地完成它的第一部分（將所有內容復制到設備上的單個連續緩沖區）：

int main(){
    int nx = 100; ny = 200; nz = 300; nch = 400;
    int idx_x_beg = 50;   int_x_end = 100;
    int idx_y_beg = 100;  int_y_end = 200;
    int idx_z_beg = 150;  int_z_end = 300;
    int idx_ch_beg = 200; int_ch_end = 400;

    double *h_4dArray = (double *)malloc(sizeof(double)*nx*ny*nz*ch);
    double *d_4dArray, *h_temp, *d_temp;
    size_t temp_sz = (int_x_end - int_x_begin)*(idx_ch_end - idx_ch_beg + 1)*(idx_z_end - idx_z_beg + 1)*(idx_y_end - idx_y_beg + 1);
    h_temp = (double *)malloc(temp_sz*sizeof(double));
    cudaMalloc(&d_temp, temp_sz*sizeof(double));
    cudaMalloc((void**)&d_4dArray, (sizeof(cuDoubleReal)*nx*ny*nz*ch));
    size_t size_tr = 0;
    for (int temp_ch = 0; temp_ch < (idx_ch_end - idx_ch_beg + 1); temp_ch++) {
        for (int temp_z = 0; temp_z < (idx_z_end - idx_z_beg + 1); temp_z++) {
            for (int temp_y = 0; temp_y < (idx_y_end - idx_y_beg + 1); temp_y++) {
                memcpy(h_temp+size_tr
                         , h_4dArray + temp_ch*nz*ny*nx + temp_z*ny*nx + temp_y * nx + idx_x_beg
                         , sizeof(double)*(int_x_end - int_x_beg));
                size_tr += (int_x_end - int_x_beg);
            }
        }
    }
    cudaMemcpy(d_temp, h_temp, temp_sz*sizeof(double), cudaMemcpyHostToDevice);
    // if necessary, put cuda kernel here to scatter data from d_temp to d_4dArray
    return 0;
}

之后，如前所述，如果您需要設備上的 4D 表示，您將需要 CUDA kernel 為您分散數據。

如何將 4D 數組的一部分從主機 memory 復制到設備 memory？

問題描述

1 個解決方案

解決方案1
2 已采納 2021-04-28 17:35:50

如何將 4D 數組的一部分從主機 memory 復制到設備 memory？

問題描述

1 個解決方案

解決方案1 2 已采納 2021-04-28 17:35:50

解決方案1
2 已采納 2021-04-28 17:35:50