Mergesort pThread实现与单线程花费的时间相同

Question

(I have tried to simplify this as much as i could to find out where I'm doing something wrong.) （我已经尽力简化了，以找出我做错了什么。）

The ideea of the code is that I have a global array *v (I hope using this array isn't slowing things down, the threads should never acces the same value because they all work on different ranges) and I try to create 2 threads each one sorting the first half, respectively the second half by calling the function merge_sort() with the respective parameters. 代码的想法是我有一个全局数组* v（我希望使用该数组不会减慢速度，线程永远不应该获得相同的值，因为它们都在不同的范围内工作），我尝试创建2个线程每一个都通过调用带有相应参数的功能merge_sort（）对前半部分和后半部分进行排序。

On the threaded run, i see the process going to 80-100% cpu usage (on dual core cpu) while on the no threads run it only stays at 50% yet the run times are very close. 在线程运行中，我看到进程的CPU使用率达到80-100％（在双核cpu上），而在无线程运行时，它仅保持50％，但运行时间非常接近。

This is the (relevant) code: 这是（相关的）代码：

//These are the 2 sorting functions, each thread will call merge_sort(..). //这是2个排序函数，每个线程将调用merge_sort（..）。 Is this a problem? 这有问题吗？ both threads calling same (normal) function? 两个线程都调用相同（正常）功能？

void merge (int *v, int start, int middle, int end) {
    //dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
    //copies the original values into the 2 halves
    //then sorts them back into the v array
}

void merge_sort (int *v, int start, int end) {
    //recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
    //calls merge(start, middle, end) 
}

//here i'm expecting each thread to be created and to call merge_sort on its specific range (this is a simplified version of the original code to find the bug easier) //在这里，我希望创建每个线程并在其特定范围内调用merge_sort（这是原始代码的简化版本，可以更轻松地发现错误）

void* mergesort_t2(void * arg) {
    t_data* th_info = (t_data*)arg;
    merge_sort(v, th_info->a, th_info->b);
    return (void*)0;
}

//in main I simply create 2 threads calling the above function //主要，我只是创建了两个调用上述函数的线程

int main (int argc, char* argv[])
{
    //some stuff

    //getting the clock to calculate run time
    clock_t t_inceput, t_sfarsit;
    t_inceput = clock();

    //ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
    //the a and b are the range of values the created thread will have to sort
    pthread_t thread[2];
    t_data next_info[2];
    next_info[0].crt_depth = 1;
    next_info[0].a = 0;
    next_info[0].b = n/2;
    next_info[1].crt_depth = 1;
    next_info[1].a = n/2+1;
    next_info[1].b = n-1;

    for (int i=0; i<2; i++) {
        if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
            cerr<<"error\n;";
            return err;
        }
    }

    for (int i=0; i<2; i++) {
        if (pthread_join(thread[i], &status) != 0) {
            cerr<<"error\n;";
            return err;
        }
    }

    //now i merge the 2 sorted halves
    merge(v, 0, n/2, n-1);

    //calculate end time
    t_sfarsit = clock();

    cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
    delete [] v;
}

Output (on 1 million values): 产出（百万价值）：

Sort time (s): 1.294

Output with direct calling of merge_sort, no threads: 直接调用merge_sort的输出，没有线程：

Sort time (s): 1.388

Output (on 10 million values): 产出（价值一千万）：

Sort time (s): 12.75

Output with direct calling of merge_sort, no threads: 直接调用merge_sort的输出，没有线程：

Sort time (s): 13.838

Solution: 解：

I'd like to thank WhozCraig and Adam too as they've hinted to this from the beginning. 我还要感谢WhozCraig和Adam，因为他们从一开始就暗示了这一点。

I've used the inplace_merge(..) function instead of my own and the program run times are as they should now. 我使用的是inplace_merge(..)函数，而不是我自己的函数，程序的运行时间与现在一样。

Here's my initial merge function (not really sure if the initial, i've probably modified it a few times since, also array indices might be wrong right now, i went back and forth between [a,b] and [a,b), this was just the last commented-out version): 这是我的初始合并功能（不确定初始是否可以修改，此后我可能已经修改了几次，数组索引现在也可能是错误的，我在[a，b]和[a，b之间来回切换），这只是最后一个已注释掉的版本）：

void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]

    //create the 2 new arrays
    int *st = new int[m-a+1];
    int *dr = new int[c-m+1];
    //copy the values
    for (int i1 = 0; i1 <= m-a; i1++)
        st[i1] = v[a+i1];
    for (int i2 = 0; i2 <= c-(m+1); i2++)
        dr[i2] = v[m+1+i2];

    //merge them back together in sorted order
    int is=0, id=0;
    for (int i=0; i<=c-a; i++)  {
        if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
            v[a+i] = st[is];
            is++;
        }
        else {
            v[a+i] = dr[id];
            id++;
        }
    }
    delete st, dr;
}

all this was replaced with: 所有这些都被替换为：

inplace_merge(v+a, v+m, v+c);

Edit, some times on my 3ghz dual core cpu: 在我的3GHz双核CPU上进行编辑：

1 million values: 1 thread : 7.236 s 2 threads: 4.622 s 4 threads: 4.692 s 1百万个值：1个线程：7.236 s 2个线程：4.622 s 4个线程：4.692 s

10 million values: 1 thread : 82.034 s 2 threads: 46.189 s 4 threads: 47.36 s 1000万个值：1个线程：82.034 s 2个线程：46.189 s 4个线程：47.36 s

Answer 1

Note : since OP uses Windows, my answer below (which incorrectly assumed Linux) might not apply. 注意：由于OP使用Windows，以下我的回答（错误地假定为Linux）可能不适用。 I left it for sake of those who might find the information useful. 我将其保留下来是为了那些可能会觉得有用的信息。

clock() is a wrong interface for measuring time on Linux: it measures CPU time used by the program (see http://linux.die.net/man/3/clock ), which in case of multiple threads is the sum of CPU time for all threads. clock()是用于在Linux上测量时间的错误接口：它测量程序使用的CPU时间（请参阅http://linux.die.net/man/3/clock ），如果有多个线程，则该时间为所有线程的CPU时间。 You need to measure elapsed, or wallclock, time. 您需要测量经过时间或挂钟时间。 See more details in this SO question: C: using clock() to measure time in multi-threaded programs , which also tells what API can be used instead of clock() . 请参阅此SO问题的更多详细信息： C：使用clock（）来测量多线程程序中的时间，这还告诉您可以使用哪种API代替clock() 。

In the MPI-based implementation that you try to compare with, two different processes are used (that's how MPI typically enables concurrency), and the CPU time of the second process is not included - so the CPU time is close to wallclock time. 在您尝试与之进行比较的基于MPI的实现中，使用了两个不同的进程（MPI通常启用并发性），并且不包括第二个进程的CPU时间-因此，CPU时间接近壁钟时间。 Nevertheless, it's still wrong to use CPU time (and so clock() ) for performance measurement, even in serial programs; 然而，即使在串行程序中，使用CPU时间（和clock() ）进行性能测量仍然是错误的。 for one reason, if a program waits for eg a network event or a message from another MPI process, it still spends time - but not CPU time. 由于一个原因，如果程序等待例如网络事件或来自另一个MPI进程的消息，它仍会花费时间-而不会花费CPU时间。

Update : In Microsoft's implementation of C run-time library, clock() returns wall-clock time , so is OK to use for your purpose. 更新：在Microsoft C运行时库的实现中， clock()返回wall-clock time ，因此可以用于您的目的。 It's unclear though if you use Microsoft's toolchain or something else, like Cygwin or MinGW. 目前还不清楚是否使用Microsoft的工具链或其他工具，例如Cygwin或MinGW。

Answer 2

There's one thing that struck me: "dynamically creates 2 new arrays[...]". 有一件令我震惊的事情：“动态创建2个新数组[...]”。 Since both threads will need memory from the system, they need to acquire a lock for that, which could well be your bottleneck. 由于两个线程都需要系统内存，因此它们需要为此获取一个锁，这很可能是您的瓶颈。 In particular the idea of doing microscopic array allocations sounds horribly inefficient. 特别是，进行微观阵列分配的想法听起来效率极低。 Someone suggested an in-place sort that doesn't need any additional storage, which is much better for performance. 有人建议就地排序，不需要任何额外的存储，这样可以提高性能。

Another thing is the often-forgotten starting half-sentence for any big-O complexity measurements: "There is an n0 so that for all n>n0...". 另一件事是，对于任何big-O复杂度测量，经常被忘记的开始半句：“有一个n0，因此对于所有n> n0 ...”。 In other words, maybe you haven't reached n0 yet? 换句话说，也许您还没有达到n0？ I recently saw a video (hopefully someone else will remember it) where some people tried to determine this limit for some algorithms, and their results were that these limits are surprisingly high . 最近，我看了一段视频（希望其他人会记住它），其中有人尝试确定某些算法的限制，结果是这些限制令人惊讶地很高 。

Mergesort pThread实现与单线程花费的时间相同

问题描述

2 个解决方案

解决方案1
0 2014-06-10 15:22:25

解决方案2
0 已采纳 2014-06-10 18:51:45

Mergesort pThread实现与单线程花费的时间相同

问题描述

2 个解决方案

解决方案1 0 2014-06-10 15:22:25

解决方案2 0 已采纳 2014-06-10 18:51:45

解决方案1
0 2014-06-10 15:22:25

解决方案2
0 已采纳 2014-06-10 18:51:45