简体   繁体   English

如何使用C ++ AMP并行化以下循环?

[英]How could I parallelize the following loop USING C++ AMP?

I have the following loop in c++ 我在c ++中有以下循环

dword result = 0;
for ( int i = 0; i < 16; i++ ) {
    result |= ( value[i] << (unsigned int)( i << 1 ) );
}

And I would like to parallelize it in amp. 我想在放大器中对其进行并行化。 I know it might go slower then the actual non-parallelized version above, but I want to do it to learn something more about AMP. 我知道它的速度可能会比上面的实际非并行版本慢,但我想这样做是为了了解有关AMP的更多信息。

My idea was to loop trough the value array in parallel: 我的想法是并行遍历值数组:

在此处输入图片说明

And fill a new array with newarray[0] = value[0] << (unsigned int)(0 << 1 ) , newarray[1] = value[1] << (unsigned int)(1 << 1 ) , etc. Then I would OR the values in the array in parallel in a tree structure (see image). 并用newarray[0] = value[0] << (unsigned int)(0 << 1 )newarray[1] = value[1] << (unsigned int)(1 << 1 )填充新数组,等。然后,我将以树结构并行处理数组中的值(参见图片)。

I have tried to put this idea in some simple c++ amp code, but I don't succeed in it, so any help would be appreciated. 我试图将这个想法放入一些简单的c ++ amp代码中,但是我没有成功,所以我们将不胜感激。

Thank you for your consideration of this matter, I look forward to a response. 感谢您对此事的考虑,我期待得到答复。

The following code is part of what I think you need. 以下代码是我认为您需要的一部分。 This code will take a number of elements as input and preps the vector on the CPU, then it does the bit shift operations in parallel on the GPU. 该代码将大量元素作为输入,并将向量准备在CPU上,然后在GPU上并行执行位移位操作。 Then I set av[elements] back to 0 because I am using that element to store your final result. 然后将av [elements]设置为0,因为我正在使用该元素存储您的最终结果。 It's rough, but AMP is pretty restrictive about what data types can be processed on the GPU, so I just use an extra element of the existing array for it. 这很粗糙,但是AMP对可以在GPU上处理的数据类型有严格的限制,因此我只使用现有数组的一个额外元素。 After the bit shifting is done, I do another parallel for each for the bitwise OR function. 移位完成后,我将对每个按位“或”函数进行另一个并行处理。 This one also happens on the GPU, but it is less satisfactory because every operation is ORing any given element of the array with exactly the av[elements] element, so that will create a bottleneck. 这也发生在GPU上,但是由于每个操作都将数组的任何给定元素与av [elements]元素进行“或”运算,因此不太令人满意,这会造成瓶颈。 Your tree structure will make this part run much more quickly, but I was unable to figure out how to do that part easily. 您的树结构将使该部分的运行速度更快,但是我无法弄清楚如何轻松地执行该部分。 As it is, this program can process 100 million elements in a couple seconds on a fairly old computer. 实际上,此程序可以在一台相当老的计算机上在几秒钟内处理1亿个元素。 Apologies in advance for any best-practice violations in the code; 如有任何违反最佳做法的行为,请提前道歉; I am a novice as well. 我也是新手。 The code follows: 代码如下:

#include <conio.h>
#include <amp.h>
#include <iostream>

using namespace concurrency;
using namespace std;

unsigned int doParallel(unsigned int);

unsigned int elements;

void main()
{
    int ch=NULL;
    cout<<"\nHow many elements to populate: ";
    cin>>elements;
    cout<<"The result is: "<<doParallel(elements);
    cout<<"\nPress 'X' to exit.";
    do
    {
        ch=_getch();
    } while (ch!='X' && ch!='x');
    exit(0);
}

unsigned int doParallel(unsigned int elements)
{
    vector<unsigned int> v(elements+1);

    for (unsigned int i = 0; i<elements+1;i++)
        {
            v[i]=i;
        }
    array_view<unsigned int,1> av(elements+1,v);

    parallel_for_each(av.extent,[=](index<1> idx)
        restrict(amp)
        {
            av[idx] = static_cast<unsigned int>(av[idx])<<1;
        });
    av[elements]=0;
    parallel_for_each(av.extent,[=](index<1> idx)
    restrict(amp)
    {
        av[elements] |= static_cast<unsigned int>(av[idx]);
    });

    return av[elements];
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM