简体   繁体   English

C++/调试(AIX 上的 g++)递归快速排序导致分段错误

[英]C++/Debugging (g++ on AIX) Recursive Quicksort Causing Segmentation Faults

I have a program where I need to sort a large number of large numeric distributions.我有一个程序,需要对大量大型数字分布进行排序。 To reduce the time it takes to do this, I am trying to multi-thread this.为了减少执行此操作所需的时间,我正在尝试多线程执行此操作。

I wrote a small, simple abstraction of my program to try and isolate the issue.我写了一个小而简单的程序抽象来尝试隔离问题。 I believe I am encountering a stack overflow, or hitting the operating system's stack limit because my test program mirrors the segmentation fault issue when:我相信我遇到了堆栈溢出或达到操作系统的堆栈限制,因为我的测试程序在以下情况下会反映分段错误问题:

  • The distributions are all the same value (meaning qsort will run like crap)分布都是相同的值(意味着 qsort 会像垃圾一样运行)
  • Threading is enabled.线程已启用。

meow

#include <boost/thread/thread.hpp>
#include <vector>
#include <stdlib.h> // for rand()

void swapvals(double *distribution, const size_t &d1, const size_t &d2)
{
    double temp = 0;
    temp = distribution[d2];
    distribution[d2] = distribution[d1];
    distribution[d1] = temp;
    //std::swap(distribution[d1], distribution[d2]);

}

size_t partition(double *distribution,  size_t left, size_t right)
{
        const double pivot = distribution[right];

        while (left < right) {

                while ((left < right) && distribution[left] <= pivot)
                        left++;

                while ((left < right) && distribution[right] > pivot)
                        right--;

                if (left < right)
                {
                        swapvals(distribution, left, right);
                }
        }
        return right;
}

void quickSort(double *distribution, const size_t left, const size_t right)
{
        if (left >= right) {
                return;
        }
        size_t part = partition(distribution, left, right);
        quickSort(distribution, left, part - 1);
        quickSort(distribution, part + 1, right);
}
void processDistribution(double *distributions, const size_t distribution_size)
{

       std::clog << "beginning qsorting." << std::endl;
       quickSort(distributions, 0, distribution_size - 1);
       std::clog << "done qsorting." << std::endl;

}

int main(int argc, char* argv[])
{
    size_t distribution_size = 65000;
    size_t num_distributions = 10;

    std::vector<double *> distributions;

    // Create num_distributions distributions.
    for (int i = 0; i < num_distributions; i++)
    {
        double * new_dist = new double[distribution_size];
        for (int k = 0; k < distribution_size; k++)
        {
            // Works when I have actual numbers in the distributions.
            // Seg faults when all the numbers are the same.
            new_dist[k] =1;
            //new_dist[k] = rand() % 1000 + 1; // uncomment this, and it works.
        }

        distributions.push_back(new_dist);
    }

    // Submit each distribution to a quicksort thread.
    boost::thread_group threads;
    for (std::vector<double *>::const_iterator it=distributions.begin(); it != distributions.end(); ++it)
    {
         // It works when I run processDistribution directly. Segfaults when I run it via threads.
         //processDistribution(*it, distribution_size);
         threads.create_thread(boost::bind(&processDistribution, *it, distribution_size)); 
    }
    threads.join_all();

    // Show the results of the sort for all the distributions.
    for (std::vector<double *>::const_iterator it=distributions.begin(); it != distributions.end(); ++it)
    {
        for (size_t i = 0; i < distribution_size; i++)
        {
            // print first and last 20 results.
            if (i < 20 || i > (distribution_size - 20))
                std::cout << (*it)[i] << ",";
        }
        std::cout << std::endl;
    }

}

GDB analysis of the core file yields:核心文件的 GDB 分析产生:

Error in re-setting breakpoint -1: aix-thread: ptrace (52, 18220265) returned -1 (errno = 3 The process does not exist.)
Error in re-setting breakpoint -1: aix-thread: ptrace (52, 18220265) returned -1 (errno = 3 The process does not exist.)
Error in re-setting breakpoint -2: aix-thread: ptrace (52, 18220265) returned -1 (errno = 3 The process does not exist.)
Error in re-setting breakpoint -3: aix-thread: ptrace (52, 18220265) returned -1 (errno = 3 The process does not exist.)
Core was generated by `testthreads'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00000001000056bc in partition (distribution=0x1101d1430, left=0, right=63626) at testthreads.cpp:18

warning: Source file is more recent than executable.
18
(gdb) bt 7
#0  0x00000001000056bc in partition (distribution=0x1101d1430, left=0, right=63626) at testthreads.cpp:18
#1  0x0000000100005834 in quickSort (distribution=0x1101d1430, left=0, right=63626) at testthreads.cpp:42
#2  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63627) at testthreads.cpp:43
#3  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63628) at testthreads.cpp:43
#4  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63629) at testthreads.cpp:43
#5  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63630) at testthreads.cpp:43
#6  0x0000000100005850 in quickSort (distribution=0x1101d1430, left=0, right=63631) at testthreads.cpp:43
(More stack frames follow...)
(gdb) frame 0
#0  0x00000001000056bc in partition (distribution=0x1101d1430, left=0, right=63626) at testthreads.cpp:18
18
(gdb) info locals
pivot = 1
(gdb) info args
distribution = 0x1101d1430
left = 0
right = 63626
(gdb)

Also, my actual program deals with many more threads and distributions.此外,我的实际程序处理更多的线程和分布。 And the GDB inspections there often show far more bizarre stack traces that look like memory corruption (notice how swapVals is called with d1 = 12119, but inside the partition stack frame it's coming through as 4568618016):并且那里的 GDB 检查经常显示更奇怪的堆栈跟踪,看起来像内存损坏(注意如何使用 d1 = 12119 调用 swapVals,但在分区堆栈帧内它作为 4568618016 通过):

(gdb) bt 3
#0  0x00000001002aa0b8 in ScenRankReplacer<double>::swapvals (this=0xfffffffffffdfc8, distribution=..., d1=@0x1104c8178: 4568618016, d2=@0x1104c8140: 4568416720, ranking_values=0x1104c81d0,
    r1=@0x1104c8170: 1152921504606838728, r2=@0x1002a16c8: 6917529029728344952) at ScenRankReplacer.h:96
#1  0x00000001002a7120 in ScenRankReplacer<double>::partition (this=0xfffffffffffdfc8, distribution=..., ranking_values=0x11069ae50, left=1, right=24237) at ScenRankReplacer.h:122
#2  0x00000001002a16c8 in ScenRankReplacer<double>::quickSort (this=0xfffffffffffdfc8, distribution=..., ranking_values=0x11069ae50, left=1, right=24237) at ScenRankReplacer.h:91
(More stack frames follow...)
(gdb) frame 1
#1  0x00000001002a7120 in ScenRankReplacer<double>::partition (this=0xfffffffffffdfc8, distribution=..., ranking_values=0x11069ae50, left=1, right=24237) at ScenRankReplacer.h:122
122             swapvals(distribution, mid, left, ranking_values, mid - 1, left - 1);
(gdb) p mid
$1 = 12119
(gdb) p left
$2 = 1

So...my questions:所以......我的问题:

  1. Am I correct?我对么? Am I hitting a stack limit?我是否达到了堆栈限制?
  2. How on earth do I ascertain that this is the case (other than the deduction i've done above)?我到底如何确定情况确实如此(除了我上面所做的演绎)? Is there an easy way to detect these?有没有一种简单的方法来检测这些? a GDB clue or something? GDB 线索什么的?
  3. Why does the threading matter?为什么线程很重要? Do all the threads share the same stack limit?所有线程是否共享相同的堆栈限制?
  4. Most important: How do I get this working?!最重要的是:我如何让它工作?! Is a recursive quicksort on massive datasets just not feasable?海量数据集上的递归快速排序不可行吗?

The error occurs with compilation level O2.错误发生在编译级别 O2。 Thread model: aix gcc version 4.8.3 (GCC)线程模型:aix gcc version 4.8.3 (GCC)

This looks like it could be stack space related.这看起来可能与堆栈空间有关。 Threading matters because, while all threads have their own stacks, those stacks all share the same memory pool.线程很重要,因为虽然所有线程都有自己的堆栈,但这些堆栈都共享相同的内存池。 Stacks will normally grow as needed until they run into memory that is already used, which in this case would likely be the stack from another thread.堆栈通常会根据需要增长,直到它们遇到已经使用的内存,在这种情况下可能是来自另一个线程的堆栈。 A single threaded program won't have that issue and can grow it's stack larger.单线程程序不会有这个问题,并且可以增加它的堆栈。 (Also with multiple threads you're doing multiple sorts at the same time which would require more stack space.) (对于多个线程,您会同时进行多种排序,这将需要更多的堆栈空间。)

One way to fix this is to remove the recursion and use some loops and local storage to replace it.解决此问题的一种方法是删除递归并使用一些循环和本地存储来替换它。 Something like this (uncompiled or tested) code:像这样(未编译或测试)代码:

void quickSort(double *distribution, size_t left, size_t right) {
    std::vector<std::pair<size_t, size_t>> ranges;
    for (;;) {
        for (;;) {
            if (left <= right)
                break;
            size_t part = partition(distribution, left, right);

            // save range for later to replace the second recursive call
            ranges.push_back(std::make_pair(part + 1, right));

            // set right == part - 1, then loop, to replace the first recursive call
            right = part - 1;
        }
        if (ranges.empty())
            break;

        // Take top off of ranges for the next loop, replacing the second recursive call
        left = ranges.back().first;
        right = ranges.back().second;
        ranges.pop_back();
    }
}

so after a bit more hair pulling I've figured out the answers to my questions.所以在拉了更多头发之后,我找到了我问题的答案。

  1. Am I correct?我对么? Am I hitting a stack limit?我是否达到了堆栈限制? How on earth do I ascertain that this is the case (other than the deduction i've done above)?我到底如何确定情况确实如此(除了我上面所做的演绎)? AND

  2. Is there an easy way to detect these?有没有一种简单的方法来检测这些? a GDB clue or something? GDB 线索什么的?

A: Yes.答:是的。 The program was overflowing the stack.程序溢出堆栈。 I could not ascertain a direct way to determine that this was the case on AIX.我无法确定一种直接的方法来确定 AIX 上的情况。 However, when I put the code into visual studio 2015 on Windows and ran it, the program crashed with a clear "Stack Overflow" error.但是,当我将代码放入 Windows 上的 Visual Studio 2015 并运行它时,程序崩溃了,并出现了明显的“堆栈溢出”错误。

I was hoping there was a way to get a clear 'Stack Overflow' error on AIX, similar to the VS result.我希望有一种方法可以在 AIX 上获得明确的“堆栈溢出”错误,类似于 VS 结果。 I could not find a way.我找不到办法。 Even compiling using -fstack-check didn't give me a Storage Error :(即使使用 -fstack-check 编译也没有给我一个存储错误:(

  1. Why does the threading matter?为什么线程很重要? Do all the threads share the same stack limit?所有线程是否共享相同的堆栈限制?

A: The default stack size for threads on AIX is surprisingly small! A: AIX 上线程的默认堆栈大小非常小!

From this IBM developerworks blog post: 来自这篇 IBM developerworks 博客文章:

For a 32-bit compiled application on AIX, the default pthread stacksize is 96 KB;对于 AIX 上的 32 位编译应用程序,默认 pthread 堆栈大小为 96 KB; and for a 64-bit compiled application on AIX,对于 AIX 上的 64 位编译应用程序,

  1. Most important: How do I get this working?!最重要的是:我如何让它工作?! Is a recursive quicksort on massive datasets just not feasable?海量数据集上的递归快速排序不可行吗?

I can only conceive of two ways: A1: The first would be to increase the stack size.我只能想到两种方法: A1:第一种是增加堆栈大小。

From the IBM Debugging Guidelines for Threads The minimum stack size for a thread is 96KB. 来自 IBM 线程调试指南 线程的最小堆栈大小为 96KB。 It is also the default stack size.它也是默认的堆栈大小。 This number can be retrieved at compilation time using the PTHREAD_STACK_MIN symbolic constant defined in the pthread.h header file.可以在编译时使用 pthread.h 头文件中定义的 PTHREAD_STACK_MIN 符号常量检索此数字。

Note that the maximum stack size is 256MB, the size of a segment.请注意,最大堆栈大小为 256MB,即一个段的大小。 This limit is indicated by the PTHREAD_STACK_MAX symbolic constant in the pthread.h header file.此限制由 pthread.h 头文件中的 PTHREAD_STACK_MAX 符号常量指示。

So one could increase the stack size to a maximum of 256MB which is quite a lot.因此,可以将堆栈大小增加到最大 256MB,这是相当多的。

A2: The other way is to simply avoid potentially unbound recursion. A2:另一种方法是简单地避免潜在的未绑定递归。 My datasets are incredibly large.我的数据集非常大。 Probably not large enough to spend 256MB of stack, but it was reasonably straightforward to rewrite the quicksort function iteratively.可能不够大,无法花费 256MB 的堆栈,但迭代地重写快速排序函数相当简单。

void quickSort_iter(double *distribution, size_t left, size_t right)
{
        if (left >= right)
                return;

        std::stack<std::pair<size_t, size_t> > partition_stack;
        partition_stack.push(std::pair<size_t, size_t>(left, right));

        while (!partition_stack.empty())
        {

                left = partition_stack.top().first;
                right = partition_stack.top().second;
                partition_stack.pop();

                size_t pivot = partition(distribution, left, right);

                if (pivot > 1)
                        partition_stack.push(std::pair<size_t, size_t>(left, pivot - 1));

                if (pivot + 1 < right)
                        partition_stack.push(std::pair<size_t, size_t>(pivot + 1, right));
        }
}

std::stack is being created using the default std::allocator, so internally it is using heap allocations to store the stack of sorting partitions and therefore will not run afoul of the stack limit. std::stack 是使用默认的 std::allocator 创建的,因此它在内部使用堆分配来存储排序分区的堆栈,因此不会违反堆栈限制。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM