并行的速度比顺序的慢

Question

My program shall perform a parallel distinct rotation of words and texts. 我的程序应同时执行单词和文本的并行旋转。

If you do not know what this means: Rotations of "BANANA" are 如果您不知道这意味着什么：将“ BANANA”的旋转

BANANA 香蕉
ANANAB 阿纳巴
NANABA 纳纳巴
ANABAN 阿纳本
NABANA 纳巴纳
ABANAN 阿巴南

(simply put the first letter to the end.) （只需将第一个字母放在末尾即可。）

vector<string> rotate_sequentiell( string* word )
{
vector<string> all_rotations;

for ( unsigned int i = 0; i < word->size(); i++ )
{
    string rotated = word->substr( i ) + word->substr( 0,i );
    all_rotations.push_back( rotated );
}

if ( verbose ) { printVec(&all_rotations, "Rotations"); }


return all_rotations;
}

We should be able to make this parallel. 我们应该能够做到这一点。 Instead of moving just one letter to the end, I want to move two letters at once to the end, so for example, we take BANANA Take te "BA" to the end and get NANA BA, which is the third entry in the list above. 我想一次将两个字母移到末尾，而不是仅将一个字母移到末尾，因此，例如，我们将BANANA Take te“ BA”移到末尾并获得NANA BA，这是列表中的第三项以上。

I implemented it like this 我这样实现

vector<string> rotate_parallel( string* word )
{
vector<string> all_rotations( word->size() );

#pragma omp parallel for
for ( unsigned int i = 0; i < word->size(); i++ )
{
    string rotated = word->substr( i ) + word->substr( 0,i );
    all_rotations[i] = rotated;
}

if ( verbose ) { printVec(&all_rotations, "Rotations"); }

return all_rotations;
}

I pre-calculated the number of possible rotations and used the #pragma omp parallel for, so it should do what I think it does. 我预先计算了可能的旋转次数，并使用了#pragma omp并行处理，因此它应该执行我认为的操作。

To test these functions, I have a 40KB large text-file which is meant to be "rotated". 为了测试这些功能，我有一个40KB的大文本文件，该文件可以“旋转”。 I wanna have all the distinct rotations of a giant text. 我想拥有一个巨大的文字的所有不同的旋转。

What happens now is, that the sequential procedure tooks like 4.3 seconds and the parallel tooks like 6.5 seconds. 现在发生的是，顺序过程大约需要4.3秒，而并行过程大约需要6.5秒。

Why is that so? 为什么会这样？ What am I doing wrong? 我究竟做错了什么？

This is how I measure time: 这是我测量时间的方式：

clock_t start, finish;
start = clock();
bwt_encode_parallel( &glob_word, &seperator );
finish = clock();
cout << "Time (seconds): "
     << ((double)(finish - start))/CLOCKS_PER_SEC;

I compile my code with 我用编译我的代码

g++ -O3 -g -Wall -lboost_regex -fopenmp -fmessage-length=0 g ++ -O3 -g -Wall -lboost_regex -fopenmp -fmessage-length = 0

Answer 1

The parallel version has 2 sources of additional work compared to the sequential version: (1) overhead of starting the threads, and (2) coordination and locking between the threads. 与顺序版本相比，并行版本有2个额外的工作来源：（1）启动线程的开销，以及（2）线程之间的协调和锁定。

Impact of (1) Should diminish when the data set grows larger, and probably can't be worth 2 seconds anyway, but this would set the limit of how small jobs it makes sense to parallelize. （1）的影响应在数据集变大时减小，并且无论如何可能不值得2秒，但这将设置并行化有意义的小作业的限制。

(2) is in your case probably mostly caused by omp assigning tasks to the threads, and the different threads doing memory allocation for the 2 intermediate substrings and the final string "rotated" - the memory allocation routine probably has to get a global lock before it can reserve a piece of the heap for you. （2）在您的情况下，可能主要是由于omp将任务分配给线程而导致的，并且不同的线程为2个中间子字符串和最后一个字符串“旋转”进行内存分配-内存分配例程可能必须先获得全局锁它可以为您保留一堆。

Preallocating the final storage in a single thread and guiding OMP to run the parallel loop in large (2048) blocks of iterations per thread tilts the result to to favor the parallel execution. 在单个线程中预分配最终存储，并指导OMP在每个线程的大（2048）迭代块中运行并行循环，从而使结果倾斜，从而有利于并行执行。 I get about 700ms for the single threaded and 330ms for the multithreaded version with the code below: 使用以下代码，单线程可获得约700ms，多线程版本可获得约330ms：

 enum {SZ = 40960};
 std::string word;
 word.resize(SZ);
 for (int i = 0; i < SZ; i++) {
   word[i] = (i & 127) + 1;  // put stuff into the word
 }
 std::vector<std::string> all_rotations(SZ);
 clock_t start, finish;
 start = clock();
 for (int i = 0; i < (int)word.size(); i++) {
   all_rotations[i].reserve(SZ);
 }
 #pragma omp parallel for schedule (static, 2048)
 for (int i = 0; i < (int)word.size(); i++) {
   std::string rotated = word.substr(i) + word.substr(0, i);
   all_rotations[i] = rotated;
 }
 finish = clock();
 printf("Time (seconds): %0.3lf\n", ((double)(finish - start))/CLOCKS_PER_SEC);

Last, when you need the results of the burrows wheeler transform, you don't necessarily want N copies of a string that contains N characters. 最后，当您需要掘穴轮车变换的结果时，您不一定需要包含N个字符的字符串的N个副本。 It would save space and processing to treat the string as a ring buffer and read each rotation from a different offset in the buffer. 将字符串视为环形缓冲区并从缓冲区中的不同偏移量读取每次旋转将节省空间和处理。

并行的速度比顺序的慢

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-01-24 18:48:46

并行的速度比顺序的慢

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-01-24 18:48:46

解决方案1
2 已采纳 2016-01-24 18:48:46