简体   繁体   English

Pthreads无法解释的分段错误

[英]Pthreads unexplained segmentation fault

I implemented a parallel merge sort algorithm from Cormen's well-known text. 我从Cormen的著名文本中实现了一个并行合并排序算法。 I wrote it in C using pthreads, and compiled with MinGW on Win7 x64 (also tested later with GCC in Ubuntu with same results). 我使用pthreads在C语言中编写了该代码,并在Win7 x64上使用MinGW进行了编译(稍后也在Ubuntu中的GCC上进行了测试,结果相同)。 My first approach at the parallelization was naïve... I spawned a new thread at every recursion level (which is actually what Cormen's pseudocode implies). 我在并行化方面的第一个方法很幼稚……我在每个递归级别上都产生了一个新线程(这实际上是Cormen的伪代码所暗示的)。 However this usually ends up either taking way too long or crashing due to segmentation fault (I can assume there is some hard limit to how many threads the system can handle). 但是,这通常会花费很长时间或由于分段错误而崩溃(我可以假设系统可以处理多少个线程有一定的硬性限制)。 This seems to be a common newbie mistake for recursive parallelization, in fact I found a similar DISCUSSION on this site. 对于递归并行化,这似乎是一个常见的新手错误,实际上,我在该站点上发现了类似的讨论 So I instead used the recommendation in that thread, namely setting a threshold for problem size, and if the function that spawns new threads is given a set smaller than the threshold (say 10,000 elements) then it just operates on the elements directly, rather than creating a new thread for such a small set. 因此,我改为在该线程中使用建议,即设置问题大小的阈值,并且如果产生新线程的函数的集合小于阈值(例如10,000个元素),那么它将直接对这些元素进行操作,而不是直接对这些元素进行操作为这么小的集合创建一个新线程。

Now everything seemed to be working fine. 现在一切似乎都正常。 I tabulated some of my results below. 我在下面列出了一些结果。 N is problem size (a set of integers [1, 2, 3, ..., N] thoroughly scrambled) and threshold is the value below which my parallel sort and parallel merge functions refuse to spawn new threads. N是问题的大小(一组完全乱序的整数[1、2、3,...,N]),阈值是我的并行排序和并行合并函数拒绝产生新线程的值。 The first table shows sort time in ms, the second shows how many sort/merge worker threads were spawned in each case. 第一个表显示以毫秒为单位的排序时间,第二个表显示在每种情况下产生了多少个排序/合并工作线程。 Looking at the N=1E6 and N=1E7 rows in the bottom table, you can see that anytime I lower the threshold such that more than ~8000 merge workers are allowed, I get segmentation fault. 查看底表中的N = 1E6和N = 1E7行,您可以看到,只要降低阈值,以便允许超过8000个合并工作程序,就会出现分段错误。 Again, I assume that is due to some limit the system gives on threads, and I'd be happy to hear more about that, but it's not my main question. 再一次,我认为这是由于系统对线程的限制,我很乐意听到更多有关此的信息,但这不是我的主要问题。

The main question, is why does the final row get segfault when trying to use a fairly high threshold, which would have spawned an expected 15/33 worker threads (following pattern from previous rows). 主要问题是,当尝试使用较高的阈值时,为什么最后一行会出现段错误,这会产生预期的15/33工作线程(遵循前几行的模式)。 Surely this is not too many threads for my system to handle. 当然,对于我的系统来说,这不是太多线程。 The one instance which did complete (lower right cell in table) used about 1.2GB RAM (my system has 6GB), and the threaded versions never seem to take more RAM compared to the ones with 0 threads at the right of each row. 一个完成的实例(表的右下角单元)使用了约1.2GB RAM(我的系统具有6GB),与每行右侧具有0个线程的线程相比,线程版本似乎从未占用更多的RAM。

  • I don't think I am hitting any sort of heap limit... tons of RAM available and it should only take ~1GB even if it was allowed to spawn the 15/33 threads. 我不认为我会遇到任何堆限制...可用的大量内存,即使允许产生15/33线程,它也只需要大约1GB。
  • I also don't think it is a stack problem. 我也不认为这是一个堆栈问题。 I designed the program to use minimal stack, and I don't think the footprint each thread would be related to problem size N at all, only the heap. 我将程序设计为使用最小的堆栈,并且我认为每个线程的占用量根本不会与问题大小N相关,而仅与堆有关。 I'm pretty inexperienced with this... but I did a core dump stack backtrace in gdb and the addresses from top to bottom of stack seem close enough to rule out overflow there. 我对此没有经验...但是我在gdb中做了一个核心转储堆栈回溯,并且堆栈顶部到底部的地址似乎足够接近,可以排除那里的溢出。
  • I tried reading the return values of pthread_create... in Windows I got a value of 11 a few times before the crash (but it didn't seem to trigger the crash, since there were a few 11's, then a few 0's, ie no error, then another 11). 我尝试读取pthread_create的返回值...在Windows中,崩溃前几次获得了11的值(但它似乎没有触发崩溃,因为先有11个,然后有0个,即没有错误,然后再输入11)。 That error code is EAGAIN, resources unavailable... but I am not sure what it really means here. 该错误代码是EAGAIN,资源不可用...但是我不确定这在这里真正意味着什么。 Moreover, in Ubuntu the error code was 0 every time even right up to the crash. 此外,在Ubuntu中,每次直到崩溃时,错误代码均为0。
  • I tried Valgrind and got a lot of messages about memory leaks, but I am not sure those are legit since I know Valgrind requires extra resources, and I was able to get those types of errors on other problem set sizes that worked fine without Valgrind. 我尝试了Valgrind并收到了很多有关内存泄漏的消息,但是我不确定这些消息是否合法,因为我知道Valgrind需要额外的资源,并且在没有Valgrind的情况下,我能够获得其他类型的错误集上的那些类型的错误。

It's pretty obvious it's related to problem size and system resources... I'm hoping there's some piece of general knowledge I'm missing that makes the answer really clear. 很明显,这与问题的大小和系统资源有关。我希望我缺少一些常识,这使得答案很明确。

Any ideas? 有任何想法吗? Sorry for the long wall of text... thanks if you've read this far! 很抱歉,文字冗长...如果您已经读了这么长时间,谢谢! I can post the source if it seems relevant. 如果看起来相关,我可以发布源代码。

在此处输入图片说明

EDIT: Source added for reference: 编辑:添加源以供参考:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

const int               N = 100000000;
const int  SORT_THRESHOLD = 10000000;
const int MERGE_THRESHOLD = 10000000;

int  sort_thread_count = 0;
int merge_thread_count = 0;

typedef struct s_pmergesort_args {
    int *vals_in, p, r, *vals_out, s;
} pmergesort_args;

typedef struct s_pmerge_args {
    int *temp, p1, r1, p2, r2, *vals_out, p3;
} pmerge_args;

void *p_merge_sort(void *v_pmsa);
void *p_merge(void *v_pma);
int binary_search(int val, int *temp, int p, int r);

int main() {
    int *values, i, rand1, rand2, temp, *sorted;
    long long rand1a, rand1b, rand2a, rand2b;
    struct timeval start, end;

    /* allocate values on heap and initialize */
    values = malloc(N * sizeof(int));
    sorted = malloc(N * sizeof(int));
    for (i = 0; i < N; i++) {
        values[i] = i + 1;
        sorted[i] = 0;
    }

    /* scramble
     *  - complicated logic to maximize swapping
     *  - lots of testing (not shown) was done to verify optimal swapping */
    srand(time(NULL));
    for (i = 0; i < N/10; i++) {
        rand1a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand1b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand1 = (int)((rand1a * rand1b + rand()) % N);
        rand2a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand2b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand2 = (int)((rand2a * rand2b + rand()) % N);
        temp = values[rand1];
        values[rand1] = values[rand2];
        values[rand2] = temp;
    }

    /* set up args for p_merge_sort */
    pmergesort_args pmsa;
    pmsa.vals_in = values;
    pmsa.p = 0;
    pmsa.r = N-1;
    pmsa.vals_out = sorted;
    pmsa.s = 0;

    /* sort */
    gettimeofday(&start, NULL);
    p_merge_sort(&pmsa);
    gettimeofday(&end, NULL);

    /* verify sorting */
    for (i = 1; i < N; i++) {
        if (sorted[i] < sorted[i-1]) {
            fprintf(stderr, "Error: array is not sorted.\n");
            exit(0);
        }
    }
    printf("Success: array is sorted.\n");
    printf("Sorting took %dms.\n", (int)(((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))/1000));

    free(values);
    free(sorted);

    printf("(  sort threads created: %d )\n", sort_thread_count);
    printf("( merge threads created: %d )\n", merge_thread_count);

    return 0;
}

void *p_merge_sort(void *v_pmsa) {
    pmergesort_args pmsa = *((pmergesort_args *) v_pmsa);
    int *vals_in = pmsa.vals_in;
    int p = pmsa.p;
    int r = pmsa.r;
    int *vals_out = pmsa.vals_out;
    int s = pmsa.s;

    int n = r - p + 1;
    pthread_t worker;

    if (n > SORT_THRESHOLD) {
        sort_thread_count++;
    }

    if (n == 1) {
        vals_out[s] = vals_in[p];
    } else {
        int *temp = malloc(n * sizeof(int));
        int q = (p + r) / 2;
        int q_ = q - p + 1;

        pmergesort_args pmsa_l;
        pmsa_l.vals_in = vals_in;
        pmsa_l.p = p;
        pmsa_l.r = q;
        pmsa_l.vals_out = temp;
        pmsa_l.s = 0;

        pmergesort_args pmsa_r;
        pmsa_r.vals_in = vals_in;
        pmsa_r.p = q+1;
        pmsa_r.r = r;
        pmsa_r.vals_out = temp;
        pmsa_r.s = q_;

        if (n > SORT_THRESHOLD) {
            pthread_create(&worker, NULL, p_merge_sort, &pmsa_l);
        } else {
            p_merge_sort(&pmsa_l);
        }
        p_merge_sort(&pmsa_r);

        if (n > SORT_THRESHOLD) {
            pthread_join(worker, NULL);
        }

        pmerge_args pma;
        pma.temp = temp;
        pma.p1 = 0;
        pma.r1 = q_ - 1;
        pma.p2 = q_;
        pma.r2 = n - 1;
        pma.vals_out = vals_out;
        pma.p3 = s;
        p_merge(&pma);
        free(temp);
    }
}

void *p_merge(void *v_pma) {
    pmerge_args pma = *((pmerge_args *) v_pma);
    int *temp = pma.temp;
    int p1 = pma.p1;
    int r1 = pma.r1;
    int p2 = pma.p2;
    int r2 = pma.r2;
    int *vals_out = pma.vals_out;
    int p3 = pma.p3;

    int n1 = r1 - p1 + 1;
    int n2 = r2 - p2 + 1;
    int q1, q2, q3, t;
    pthread_t worker;

    if (n1 < n2) {
        t = p1; p1 = p2; p2 = t;
        t = r1; r1 = r2; r2 = t;
        t = n1; n1 = n2; n2 = t;
    }
    if (n1 > MERGE_THRESHOLD) {
        merge_thread_count++;
    }

    if (n1 == 0) {
        return;
    } else {

        q1 = (p1 + r1) / 2;
        q2 = binary_search(temp[q1], temp, p2, r2);
        q3 = p3 + (q1 - p1) + (q2 - p2);
        vals_out[q3] = temp[q1];

        pmerge_args pma_l;
        pma_l.temp = temp;
        pma_l.p1 = p1;
        pma_l.r1 = q1-1;
        pma_l.p2 = p2;
        pma_l.r2 = q2-1;
        pma_l.vals_out = vals_out;
        pma_l.p3 = p3;

        if (n1 > MERGE_THRESHOLD) {
            pthread_create(&worker, NULL, p_merge, &pma_l);
        } else {        
            p_merge(&pma_l);
        }        

        pmerge_args pma_r;
        pma_r.temp = temp;
        pma_r.p1 = q1+1;
        pma_r.r1 = r1;
        pma_r.p2 = q2;
        pma_r.r2 = r2;
        pma_r.vals_out = vals_out;
        pma_r.p3 = q3+1;

        p_merge(&pma_r);

        if (n1 > MERGE_THRESHOLD) {
            pthread_join(worker, NULL);
        }
    }
}

int binary_search(int val, int *temp, int p, int r) {
    int low = p;
    int mid;
    int high = (p > r+1)? p : r+1;

    while (low < high) {
        mid = (low + high) / 2;
        if (val <= temp[mid]) {
            high = mid;
        } else {
            low = mid + 1;
        }
    }
    return high;
}

EDIT 2: Added new image below showing "max" and "total" RAM used by each version (max meaning highest simultaneous allocation/usage and total meaning the sum of all allocation requests through the program's life). 编辑2:在下面添加了新图像,显示了每个版本使用的“最大”和“总” RAM(最大值表示最高同时分配/使用量,总数表示在程序生命期内所有分配请求的总和)。 These suggest that with N=1E8 and threshold=1E7 I should get a max usage of 3.2GB (which my system should be able to support). 这些建议使用N = 1E8和threshold = 1E7时,我应该获得最大3.2GB的使用量(我的系统应该能够支持该容量)。 But again... I guess it is related to some other limitation in the pthread library... not my actual system resources. 但是,再次……我想这与pthread库中的其他一些限制有关……与我的实际系统资源无关。

在此处输入图片说明

Looks like it is running out of memory. 看起来它内存不足。 In your example, if the code is run sequentially, then the most memory it has allocated at one time is 1.6GB. 在您的示例中,如果代码按顺序运行,则一次分配的最大内存为1.6GB。 When using threads, it is using more than 3GB. 使用线程时,它使用的内存超过3GB。 I put some wrappers around the malloc/free functions, and got this result: 我在malloc / free函数周围放了一些包装,并得到了以下结果:

Allocation of 12500000 bytes failed with 3074995884 bytes already allocated.

It's easy to see that the memory usage would be more when threaded. 很容易看出,线程化后的内存使用量会更多。 In that case, it would be simultaneously sorting both the left and right sides of the overall array, and allocating two large temp buffers to do it. 在那种情况下,它将同时对整个数组的左侧和右侧进行排序,并分配两个大的临时缓冲区来执行此操作。 When run sequentially, the temp buffer for the left half would be freed before sorting the right half. 当顺序运行时,左半部分的临时缓冲区将在对右半部分进行排序之前释放。

Here are the wrappers I used: 这是我使用的包装器:

static size_t total_allocated = 0;
static size_t max_allocated = 0;
static pthread_mutex_t total_allocated_mutex;

static void *allocate(int n)
{
  void *result = 0;
  pthread_mutex_lock(&total_allocated_mutex);
  result = malloc(n);
  if (!result) {
    fprintf(stderr,"Allocation of %d bytes failed with %u bytes already allocated\n",n,total_allocated);
  }
  assert(result);
  total_allocated += n;
  if (total_allocated>max_allocated) {
    max_allocated = total_allocated;
  }
  pthread_mutex_unlock(&total_allocated_mutex);
  return result;
}


static void *deallocate(void *p,int n)
{
  pthread_mutex_lock(&total_allocated_mutex);
  total_allocated -= n;
  free(p);
  pthread_mutex_unlock(&total_allocated_mutex);
}

I ran it and got: 我运行它并得到:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 7120.0x14dc]
0x004017df in p_merge (v_pma=0x7882c120) at t.c:177
177             vals_out[q3] = temp[q1];
(gdb) p q3
$1 = 58
(gdb) p vals_out
$2 = (int *) 0x0
(gdb) 

This is a NULL pointer dereference. 这是NULL指针取消引用。 I would put an assertion after you allocate temp to make sure allocation succeeded. 在分配temp之后,我会声明一个断言,以确保分配成功。

    int *temp = malloc(n * sizeof(int));
    assert(temp);

Analyzing your algorithm a bit, it seems you are pre-allocating the memory you need to do the merge as you recursively go down. 稍微分析一下算法,看来您正在递归关闭时正在预分配进行合并所需的内存。 You might want to consider altering your algorithm to do the allocation at the time you actually perform the merge. 您可能需要考虑在实际执行合并时更改算法以进行分配。

But, if I recall correctly, merge sort allocates the second array at the very top of the algorithm before any merging occurs, then as the recursive calls unwind, they flip back and forth between the two arrays as the merge runs get longer. 但是,如果我没记错的话,合并排序会在合并之前在算法的最顶端分配第二个数组,然后随着递归调用的展开,随着合并时间的延长,它们在两个数组之间来回翻转。 This way, there is only one malloc call ever in the whole algorithm. 这样,整个算法中只有一个malloc调用。 In addition to using less memory, it will perform much better. 除了使用更少的内存外,它的性能也会好得多。

My SWAG at modifying your code to use a single allocated temporary array allocated at the top of the algorithm is shown below. 我在修改代码以使用在算法顶部分配的单个已分配临时数组的SWAG如下所示。

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <pthread.h>

const int               N = 100000000;
const int  SORT_THRESHOLD = 10000000;
const int MERGE_THRESHOLD = 10000000;

int  sort_thread_count = 0;
int merge_thread_count = 0;

typedef struct s_pmergesort_args {
    int *vals_in, p, r, *vals_out, s, *temp;
} pmergesort_args;

typedef struct s_pmerge_args {
    int *temp, p1, r1, p2, r2, *vals_out, p3;
} pmerge_args;

void *p_merge_sort(void *v_pmsa);
void *p_merge(void *v_pma);
int binary_search(int val, int *temp, int p, int r);

int main() {
    int *values, i, rand1, rand2, temp, *sorted, *scratch;
    long long rand1a, rand1b, rand2a, rand2b;
    struct timeval start, end;

    /* allocate values on heap and initialize */
    values = malloc(N * sizeof(int));
    sorted = malloc(N * sizeof(int));
    scratch = malloc(N * sizeof(int));
    for (i = 0; i < N; i++) {
        values[i] = i + 1;
        sorted[i] = 0;
    }

    /* scramble
     *  - complicated logic to maximize swapping
     *  - lots of testing (not shown) was done to verify optimal swapping */
    srand(time(NULL));
    for (i = 0; i < N/10; i++) {
        rand1a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand1b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand1 = (int)((rand1a * rand1b + rand()) % N);
        rand2a = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand2b = (long long)(N*((double)rand()/(1+(double)RAND_MAX)));
        rand2 = (int)((rand2a * rand2b + rand()) % N);
        temp = values[rand1];
        values[rand1] = values[rand2];
        values[rand2] = temp;
    }

    /* set up args for p_merge_sort */
    pmergesort_args pmsa;
    pmsa.vals_in = values;
    pmsa.p = 0;
    pmsa.r = N-1;
    pmsa.vals_out = sorted;
    pmsa.s = 0;
    pmsa.temp = scratch;

    /* sort */
    gettimeofday(&start, NULL);
    p_merge_sort(&pmsa);
    gettimeofday(&end, NULL);

    /* verify sorting */
    for (i = 1; i < N; i++) {
        if (sorted[i] < sorted[i-1]) {
            fprintf(stderr, "Error: array is not sorted.\n");
            exit(0);
        }
    }
    printf("Success: array is sorted.\n");
    printf("Sorting took %dms.\n", (int)(((end.tv_sec * 1000000 + end.tv_usec) - (start.tv_sec * 1000000 + start.tv_usec))/1000));

    free(values);
    free(sorted);
    free(scratch);

    printf("(  sort threads created: %d )\n", sort_thread_count);
    printf("( merge threads created: %d )\n", merge_thread_count);

    return 0;
}

void *p_merge_sort(void *v_pmsa) {
    pmergesort_args pmsa = *((pmergesort_args *) v_pmsa);
    int *vals_in = pmsa.vals_in;
    int p = pmsa.p;
    int r = pmsa.r;
    int *vals_out = pmsa.vals_out;
    int s = pmsa.s;
    int *scratch = pmsa.temp;

    int n = r - p + 1;
    pthread_t worker;

    if (n > SORT_THRESHOLD) {
        sort_thread_count++;
    }

    if (n == 1) {
        vals_out[s] = vals_in[p];
    } else {
        int q = (p + r) / 2;
        int q_ = q - p + 1;

        pmergesort_args pmsa_l;
        pmsa_l.vals_in = vals_in;
        pmsa_l.p = p;
        pmsa_l.r = q;
        pmsa_l.vals_out = scratch;
        pmsa_l.s = p;
        pmsa_l.temp = vals_out;

        pmergesort_args pmsa_r;
        pmsa_r.vals_in = vals_in;
        pmsa_r.p = q+1;
        pmsa_r.r = r;
        pmsa_r.vals_out = scratch;
        pmsa_r.s = q+1;
        pmsa_r.temp = vals_out;

        if (n > SORT_THRESHOLD) {
            pthread_create(&worker, NULL, p_merge_sort, &pmsa_l);
        } else {
            p_merge_sort(&pmsa_l);
        }
        p_merge_sort(&pmsa_r);

        if (n > SORT_THRESHOLD) {
            pthread_join(worker, NULL);
        }

        pmerge_args pma;
        pma.temp = scratch + p;
        pma.p1 = 0;
        pma.r1 = q_ - 1;
        pma.p2 = q_;
        pma.r2 = n - 1;
        pma.vals_out = vals_out + p;
        pma.p3 = s - p;
        p_merge(&pma);
    }
}

void *p_merge(void *v_pma) {
    pmerge_args pma = *((pmerge_args *) v_pma);
    int *temp = pma.temp;
    int p1 = pma.p1;
    int r1 = pma.r1;
    int p2 = pma.p2;
    int r2 = pma.r2;
    int *vals_out = pma.vals_out;
    int p3 = pma.p3;

    int n1 = r1 - p1 + 1;
    int n2 = r2 - p2 + 1;
    int q1, q2, q3, t;
    pthread_t worker;

    if (n1 < n2) {
        t = p1; p1 = p2; p2 = t;
        t = r1; r1 = r2; r2 = t;
        t = n1; n1 = n2; n2 = t;
    }
    if (n1 > MERGE_THRESHOLD) {
        merge_thread_count++;
    }

    if (n1 == 0) {
        return;
    } else {

        q1 = (p1 + r1) / 2;
        q2 = binary_search(temp[q1], temp, p2, r2);
        q3 = p3 + (q1 - p1) + (q2 - p2);
        vals_out[q3] = temp[q1];

        pmerge_args pma_l;
        pma_l.temp = temp;
        pma_l.p1 = p1;
        pma_l.r1 = q1-1;
        pma_l.p2 = p2;
        pma_l.r2 = q2-1;
        pma_l.vals_out = vals_out;
        pma_l.p3 = p3;

        if (n1 > MERGE_THRESHOLD) {
            pthread_create(&worker, NULL, p_merge, &pma_l);
        } else {
            p_merge(&pma_l);
        }

        pmerge_args pma_r;
        pma_r.temp = temp;
        pma_r.p1 = q1+1;
        pma_r.r1 = r1;
        pma_r.p2 = q2;
        pma_r.r2 = r2;
        pma_r.vals_out = vals_out;
        pma_r.p3 = q3+1;

        p_merge(&pma_r);

        if (n1 > MERGE_THRESHOLD) {
            pthread_join(worker, NULL);
        }
    }
}

int binary_search(int val, int *temp, int p, int r) {
    int low = p;
    int mid;
    int high = (p > r+1)? p : r+1;

    while (low < high) {
        mid = (low + high) / 2;
        if (val <= temp[mid]) {
            high = mid;
        } else {
            low = mid + 1;
        }
    }
    return high;
}

Your stressing your system way too much, as a parallelization for speedup your implementation makes not much sense. 您对系统的压力过大,因为并行化可以加快实现的速度。 Parallelization incurs a cost, your system as a whole has to do a lot of work when you flood it with threads like that, threads are not for free. 并行化会产生成本,当您用这样的线程充斥整个系统时,整个系统必须做很多工作,线程不是免费的。

In particular for your "problem" that your program crashes if you ask for too much threads, this is entirely your fault: read the manual page for pthread_create . 特别是对于您要求太多线程的程序崩溃的“问题”,这完全是您的错:请阅读pthread_create的手册页。 It states that that function returns a value, and it does it for a reason. 它指出该函数返回一个值,并且这样做是有原因的。

To gain speedup (which is what I suppose you are looking for) you can't expect to obtain more than you have physical cores in your system. 为了获得加速(我想您正在寻找的),您不能指望获得比系统中具有物理核心更多的东西。 Sometimes it is good to have a bit more threads (say twice as much) than cores, but soon the overhead that the threads create are far more than you can gain. 有时候,拥有比核心更多的线程(比如说多两倍)是很好的,但是很快,线程创建的开销远远超过了您可以获得的开销。

Then mergesort is an algorithm that typically is bound by the access to the RAM, not by the comparisons. 然后,mergesort是一种算法,通常受对RAM的访问约束,而不是由比较约束。 RAM access (even when doing it streaming like in mergesort) is orders of magnitudes slower than the CPU. RAM访问(即使在进行合并排序时也是如此)比CPU慢几个数量级。 In addition your memory bus is not a parallel device, the only parallelism that you have in memory access are the caches (if they are). 另外,您的内存总线不是并行设备,内存访问中唯一的并行性是高速缓存(如果有)。 Blowing up your memory footprint by a factor of two, may kill all performance gains. 将内存占用量增加两倍,可能会破坏所有性能提升。 In your code you even make it worse by allocating the memory down below in the individual thread invocations, since allocating memory by itself has a cost, the system has to coordinate these allocations. 在您的代码中,通过在各个线程调用中分配下面的内存,甚至使情况变得更糟,因为分配内存本身是有代价的,因此系统必须协调这些分配。

To give it another start, first write a recursive mergesort algorithm that has a decent memory handling and access pattern. 首先,请首先编写一个递归合并排序算法,该算法具有不错的内存处理和访问模式。 Only allocate some large buffers in the top node of the recursion and hand parts of it down to the recursive calls. 仅在递归的顶部节点中分配一些大缓冲区,并将其一部分递归给递归调用。

Create a separate merging routine that merges two sorted buffers into a third. 创建一个单独的合并例程,将两个排序的缓冲区合并为第三个。 Benchmark it, compute the micro-seconds per sort item that your algorithms spends. 对它进行基准测试,计算算法花费的每个排序项目的微秒。 From that with the speed of your CPU compute the number cycles that you waste per sorted item. 从中以CPU的速度计算每个分类项目浪费的周期数。 Read the assembler that your compiler produces for the merge and if you find that it looks too complicated, try to find out how to improve it. 阅读编译器为合并而生成的汇编器,如果发现它看起来太复杂,请尝试找出改进方法。

After that, start adding parallelism to your recursive function. 之后,开始为递归函数添加并行性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM