如何修复分段错误？

Question

(Edit: I have just fixed the getpid cache problem and rerun gdb and valgrind .) （编辑：我刚刚修复了getpid缓存问题并重新运行了gdb和valgrind 。）

(Edit: I just increase the size of stack for child from 200 bytes to 2000 bytes.) （编辑：我只是将 child 的堆栈大小从200字节增加到2000字节。）

// test.c
#define _GNU_SOURCE
#include <stdio.h>
#include <assert.h>
#include <syscall.h>  // For syscall to call getpid
#include <signal.h>   // For SIGCHILD
#include <sys/types.h>// For getppid
#include <unistd.h>   // For getppid and sleep
#include <sched.h>    // For clone
#include <stdlib.h>   // For calloc and free

#define STACK_SIZE 2000

void Puts(const char *str)
{
    assert(fputs(str, stderr) != EOF);
}

void Sleep(unsigned int sec)
{
    do {
        sec = sleep(sec);
    } while(sec > 0);
}

int child(void *useless)
{
    Puts("The new process is created.\n");
    assert(fprintf(stderr, "pid = %d, ppid = %d\n", (pid_t) syscall(SYS_getpid), getppid()) > 0);

    Puts("sleep for 120 secs\n");
    Sleep(120);

    return 0;
}

int main(int argc, char* argv[])
{
    Puts("Allocate stack for new process\n");
    void *stack = calloc(STACK_SIZE, sizeof(char));
    void *stack_top = (void*) ((char*) stack + STACK_SIZE - 1);
    assert(fprintf(stderr, "stack = %p, stack top = %p\n", stack, stack_top) > 0);

    Puts("clone\n");
    int ret = clone(child, stack_top, CLONE_VM | CLONE_VFORK | CLONE_PARENT | SIGCHLD, NULL);
    Puts("clone returns\n");

    Puts("Free the stack\n");
    free(stack);

    if (ret == -1)
        perror("clone(child, stack, CLONE_VM | CLONE_VFORK, NULL)");
    else {
        ret = 0;
        Puts("Child dies...\n");
    }

    return ret;
}

I compiled the program using clang-7 test.c and ran it ./a.out in bash .我使用clang-7 test.c编译程序并在bash运行它./a.out 。 It returned instantly with the following output:它立即返回以下输出：

Allocate stack for new process
stack = 0x492260, stack top = 0x492a2f
clone
The new process is created.
Segmentation fault

And it returns 139 meaning signal SIGSEGV is sent to my process.它返回139意味着信号SIGSEGV被发送到我的进程。

Then I recompiled it using -g and use valgrind --trace-children=yes ./a.out to debug it:然后我使用-g重新编译它并使用valgrind --trace-children=yes ./a.out来调试它：

|| ==14494== Memcheck, a memory error detector
|| ==14494== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
|| ==14494== Using Valgrind-3.12.0.SVN and LibVEX; rerun with -h for copyright info
|| ==14494== Command: ./a.out
|| ==14494== 
|| Allocate stack for new process
|| stack = 0x51f3040, stack top = 0x51f380f
|| clone
|| clone returns
|| Free the stack
|| Child dies...
|| ==14495== Invalid write of size 4
|| ==14495==    at 0x201322: ??? (in /home/nobodyxu/a.out)
|| ==14495==    by 0x4F2FCBE: clone (clone.S:95)
|| ==14495==  Address 0xffffffffffffffdc is not stack'd, malloc'd or (recently) free'd
|| ==14495== 
|| ==14495== 
|| ==14495== Process terminating with default action of signal 11 (SIGSEGV)
|| ==14495==  Access not within mapped region at address 0xFFFFFFFFFFFFFFDC
|| ==14495==    at 0x201322: ??? (in /home/nobodyxu/a.out)
|| ==14495==    by 0x4F2FCBE: clone (clone.S:95)
|| ==14495==  If you believe this happened as a result of a stack
|| ==14495==  overflow in your program's main thread (unlikely but
|| ==14495==  possible), you can try to increase the size of the
|| ==14495==  main thread stack using the --main-stacksize= flag.
|| ==14495==  The main thread stack size used in this run was 8388608.
|| ==14495== 
|| ==14495== HEAP SUMMARY:
|| ==14495==     in use at exit: 2,000 bytes in 1 blocks
|| ==14495==   total heap usage: 1 allocs, 0 frees, 2,000 bytes allocated
|| ==14495== 
|| ==14495== LEAK SUMMARY:
|| ==14495==    definitely lost: 0 bytes in 0 blocks
|| ==14495==    indirectly lost: 0 bytes in 0 blocks
|| ==14495==      possibly lost: 0 bytes in 0 blocks
|| ==14495==    still reachable: 2,000 bytes in 1 blocks
|| ==14495==         suppressed: 0 bytes in 0 blocks
|| ==14495== Rerun with --leak-check=full to see details of leaked memory
|| ==14495== 
|| ==14495== For counts of detected and suppressed errors, rerun with: -v
|| ==14495== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
|| ==14494== 
|| ==14494== HEAP SUMMARY:
|| ==14494==     in use at exit: 0 bytes in 0 blocks
|| ==14494==   total heap usage: 1 allocs, 1 frees, 2,000 bytes allocated
|| ==14494== 
|| ==14494== All heap blocks were freed -- no leaks are possible
|| ==14494== 
|| ==14494== For counts of detected and suppressed errors, rerun with: -v
|| ==14494== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

It also returned instantly and printed these.它也立即返回并打印了这些。

I checked the generated assembly for 0x201322 and found out that it belongs to int main(int argc, char* argv[]) :我检查了0x201322生成的程序集，发现它属于int main(int argc, char* argv[]) ：

||   20131d:    e8 8e 01 00 00          callq  2014b0 <clone@plt>
||   201322:    89 45 dc                mov    %eax,-0x24(%rbp)
||   201325:    48 bf 54 09 20 00 00    movabs $0x200954,%rdi
||   20132c:    00 00 00 
||   20132f:    e8 dc fd ff ff          callq  201110 <Puts>
||   201334:    48 bf ad 08 20 00 00    movabs $0x2008ad,%rdi
||   20133b:    00 00 00

I also tried to use set follow-fork-mode child in gdb to debug it, but this doesn't work.我也尝试在gdb使用set follow-fork-mode child来调试它，但这不起作用。

How to fix the segmentation fault?如何修复分段错误？

Answer 1

The function printf and fprintf seem to be not thread safe without various guard rails.如果没有各种防护措施，函数 printf 和 fprintf 似乎不是线程安全的。 This is detailed in segfault with clone() and printf .这在带有 clone() 和 printf 的段错误中有详细说明。

I found the problem by the brute force method of noting where the last print happened, and then commenting out lines after that until the error went away.我通过蛮力方法发现了问题，即记录上次打印发生的位置，然后注释掉之后的行，直到错误消失。

Answer 2

This segfault might be specific to glibc.此段错误可能特定于 glibc。 I build this code snippet with musl libc, and it works fine.我用 musl libc 构建了这个代码片段，它工作正常。 It doesn't seem like this is related to the thread-safety of fprintf either because clone is passed with CLONE_VFORK , which suspends the parent process.这似乎与fprintf的线程安全性无关，因为clone是通过CLONE_VFORK传递的，它暂停了父进程。

Answer 3

I use gdb to debug your program.我使用 gdb 来调试你的程序。 The error messages are as follows.错误信息如下。

The stack you applied for the child may have been released before the fprintf is real execution in the child function.你为child申请的栈可能在fprintf在子函数中真正执行之前就已经释放了。

In the child function, add fflush(stdout);在fflush(stdout); ，添加fflush(stdout); after the assert may solve your problem.在断言之后可能会解决您的问题。

Continuing.
Allocate stack for new process
stack = 0x602010, stack top = 0x6027df
clone
The new process is created.
sleep for 20 secs
clone returns
Free the stack
*** Error in `test': double free or corruption (out): 0x0000000000602010 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7ffff7a847e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7ffff7a8d37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7ffff7a9153c]
/***/***/tmp/test[0x400969]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7ffff7a2d830]
/***/***/tmp/test[0x400729]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:21 12848672                           /***/***/tmp/test
00600000-00601000 r--p 00000000 08:21 12848672                           /***/***/tmp/test
00601000-00602000 rw-p 00001000 08:21 12848672                           /***/***/tmp/test
00602000-00623000 rw-p 00000000 00:00 0                                  [heap]
7ffff0000000-7ffff0021000 rw-p 00000000 00:00 0
7ffff0021000-7ffff4000000 ---p 00000000 00:00 0
7ffff77f7000-7ffff780d000 r-xp 00000000 08:01 786957                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff780d000-7ffff7a0c000 ---p 00016000 08:01 786957                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff7a0c000-7ffff7a0d000 rw-p 00015000 08:01 786957                     /lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff7a0d000-7ffff7bcd000 r-xp 00000000 08:01 791529                     /lib/x86_64-linux-gnu/libc-2.23.so
7ffff7bcd000-7ffff7dcd000 ---p 001c0000 08:01 791529                     /lib/x86_64-linux-gnu/libc-2.23.so
7ffff7dcd000-7ffff7dd1000 r--p 001c0000 08:01 791529                     /lib/x86_64-linux-gnu/libc-2.23.so
7ffff7dd1000-7ffff7dd3000 rw-p 001c4000 08:01 791529                     /lib/x86_64-linux-gnu/libc-2.23.so
7ffff7dd3000-7ffff7dd7000 rw-p 00000000 00:00 0
7ffff7dd7000-7ffff7dfd000 r-xp 00000000 08:01 791311                     /lib/x86_64-linux-gnu/ld-2.23.so
7ffff7fd3000-7ffff7fd6000 rw-p 00000000 00:00 0
7ffff7ff7000-7ffff7ff8000 rw-p 00000000 00:00 0
7ffff7ff8000-7ffff7ffa000 r--p 00000000 00:00 0                          [vvar]
7ffff7ffa000-7ffff7ffc000 r-xp 00000000 00:00 0                          [vdso]
7ffff7ffc000-7ffff7ffd000 r--p 00025000 08:01 791311                     /lib/x86_64-linux-gnu/ld-2.23.so
7ffff7ffd000-7ffff7ffe000 rw-p 00026000 08:01 791311                     /lib/x86_64-linux-gnu/ld-2.23.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Program received signal SIGSEGV, Segmentation fault.
__GI_abort () at abort.c:125
125     abort.c: No such file or directory.

如何修复分段错误？

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-01-18 09:25:42

解决方案2
1 2019-01-20 13:07:57

解决方案3
0 2019-01-19 08:47:11

如何修复分段错误？

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-01-18 09:25:42

解决方案2 1 2019-01-20 13:07:57

解决方案3 0 2019-01-19 08:47:11

解决方案1
1 已采纳 2019-01-18 09:25:42

解决方案2
1 2019-01-20 13:07:57

解决方案3
0 2019-01-19 08:47:11