简体   繁体   English

如何在gdb堆栈跟踪充满'??'时调试分段错误?

[英]How to debug a segmentation fault while the gdb stack trace is full of '??'?

My executable contains symbol table. 我的可执行文件包含符号表 But it seems that the stack trace is overwrited. 但似乎堆栈跟踪已被覆盖。

How to get more information out of that core please? 如何从该核心获取更多信息? For instance, is there a way to inspect the heap ? 例如,有没有办法检查堆? See the objects instances populating the heap to get some clues. 查看填充堆的对象实例以获取一些线索。 Whatever, any idea is appreciated. 无论如何,任何想法都值得赞赏。

I am a C++ programmer for a living and I have encountered this issue more times than i like to admit. 我是一名C ++程序员,我遇到过这个问题的次数比我想承认的要多。 Your application is smashing HUGE part of the stack. 您的应用程序正在粉碎堆栈中的巨大部分。 Chances are the function that is corrupting the stack is also crashing on return. 机会是破坏堆栈的功能也会在返回时崩溃。 The reason why is because the return address has been overwritten, and this is why GDB's stack trace is messed up. 原因是因为返回地址已被覆盖,这就是GDB的堆栈跟踪混乱的原因。

This is how I debug this issue: 这就是我调试此问题的方法:

1)Step though the application until it crashes. 1)逐步通过应用程序直到它崩溃。 (Look for a function that is crashing on return). (查找返回时崩溃的函数)。

2)Once you have identified the function, declare a variable at the VERY FIRST LINE of the function: 2)确定函数后,在函数的第一行声明一个变量:

int canary=0;

(The reason why it must be the first line is that this value must be at the very top of the stack. This "canary" will be overwritten before the function's return address.) (它必须是第一行的原因是该值必须位于堆栈的最顶层。这个“canary”将在函数的返回地址之前被覆盖。)

3) Put a variable watch on canary, step though the function and when canary!=0, then you have found your buffer overflow! 3)在金丝雀上放一个变量值表,通过函数和金丝雀!= 0,然后你发现你的缓冲区溢出了! Another possibility it to put a variable breakpoint for when canary!=0 and just run the program normally, this is a little easier but not all IDE's support variable breakpoints. 另一种可能是它为canary!= 0设置变量断点并且只是正常运行程序,这有点容易,但不是所有IDE的支持变量断点。

EDIT: I have talked to a senior programmer at my office and in order to understand the core dump you need to resolve the memory addresses it has. 编辑:我已经和我办公室的高级程序员交谈,以了解解析内存地址所需的核心转储。 One way to figure out these addresses is to look at the MAP file for the binary, which is human readable. 找出这些地址的一种方法是查看二进制文件的MAP文件,该文件是人类可读的。 Here is an example of generating a MAP file using gcc: 以下是使用gcc生成MAP文件的示例:

gcc -o foo -Wl,-Map,foo.map foo.c

This is a piece of the puzzle, but it will still be very difficult to obtain the address of function that is crashing. 这是一个难题,但仍然很难获得崩溃的函数的地址。 If you are running this application on a modern platform then ASLR will probably make the addresses in the core dump useless. 如果您在现代平台上运行此应用程序,那么ASLR可能会使核心转储中的地址无效。 Some implementation of ASLR will randomize the function addresses of your binary which makes the core dump absolutely worthless. ASLR的一些实现将随机化二进制的函数地址,这使得核心转储绝对毫无价值。

  1. You have to use some debugger to detect, valgrind is ok 你必须使用一些调试器来检测,valgrind是可以的
  2. while you are compiling your code make sure you add -Wall option, it makes compiler will tell you if there are some mistakes or not (make sure you done have any warning in your code). 在编译代码时,请确保添加-Wall选项,它会使编译器告诉您是否存在某些错误(确保您的代码中有任何警告)。

ex: gcc -Wall -g -c -o oke.o oke.c 例如:gcc -Wall -g -c -o oke.o oke.c
3. Make sure you also have -g option to produce debugging information. 3.确保您还具有-g选项以生成调试信息。 You can call debugging information using some macros. 您可以使用某些宏调用调试信息。 The following macros are very useful for me: 以下宏对我非常有用:

__LINE__ : tells you the line __LINE__ :告诉你这一行

__FILE__ : tells you the source file __FILE__ :告诉你源文件

__func__ : tells yout the function __func__ :告诉你这个功能

  1. Using the debugger is not enough I think, you should get used to to maximize compiler ablity. 我认为使用调试器是不够的,你应该习惯于最大化编译器的能力。

Hope this would help 希望这会有所帮助

TL;DR: extremely large local variable declarations in functions are allocated on the stack, which, on certain platform and compiler combinations, can overrun and corrupt the stack. TL; DR:函数中非常大的局部变量声明在堆栈上分配,在某些平台和编译器组合中,它可能会溢出并损坏堆栈。

Just to add another potential cause to this issue. 只是为此问题添加另一个潜在原因。 I was recently debugging a very similar issue. 我最近调试了一个非常类似的问题。 Running gdb with the application and core file would produce results such as: 使用应用程序和核心文件运行gdb将产生如下结果:

Core was generated by `myExecutable myArguments'.
Program terminated with signal 6, Aborted.
#0  0x00002b075174ba45 in ?? ()
(gdb)

That was extremely unhelpful and disappointing. 那是非常无益和令人失望的。 After hours of scouring the internet, I found a forum that talked about how the particular compiler we were using (Intel compiler) had a smaller default stack size than other compilers, and that large local variables could overrun and corrupt the stack. 经过几个小时的互联网搜索,我找到了一个论坛,讨论了我们使用的特定编译器(英特尔编译器)如何比其他编译器具有更小的默认堆栈大小,并且大型局部变量可能会溢出并破坏堆栈。 Looking at our code, I found the culprit: 看看我们的代码,我找到了罪魁祸首:

void MyClass::MyMethod {
   ...
   char charBuffer[MAX_BUFFER_SIZE];
   ...

} }

Bingo! 答对了! I found MAX_BUFFER_SIZE was set to 10000000, thus a 10MB local variable was being allocated on the stack! 我发现MAX_BUFFER_SIZE设置为10000000,因此在堆栈上分配了一个10MB的局部变量 After changing the implementation to use a shared_ptr and create the buffer dynamically, suddenly the program started working perfectly. 在更改实现以使用shared_ptr并动态创建缓冲区后,突然程序开始正常运行。

尝试使用Valgrind内存调试器运行。

To confirm, was your executable compiled in release mode, ie no debug symbols....that could explain why there's ?? 要确认,你的可执行文件是否在发布模式下编译,即没有调试符号....这可以解释为什么会有? Try recompiling with -g switch which 'includes debugging information and embedding it into the executable'..Other than that, I am out of ideas as to why you have '??'... 尝试使用-g开关重新编译,其中包含调试信息并将其嵌入到可执行文件中。除此之外,我不知道为什么你有'??'...

Not really. 并不是的。 Sure you can dig around in memory and look at things. 当然,你可以在记忆中挖掘并看看事物。 But without a stack trace you don't know how you got to where you are or what the parameter values were. 但是如果没有堆栈跟踪,您不知道自己的位置或参数值是多少。

However, the very fact that your stack is corrupt tells you that you need to look for code that writes into the stack. 但是,堆栈损坏的事实告诉您需要查找写入堆栈的代码。

  • Overwriting a stack array. 覆盖堆栈数组。 This can be done the obvious way or by calling a function or system call with bad size arguments or pointers of the wrong type. 这可以通过显而易见的方式完成,也可以通过调用带有错误大小参数的函数或系统调用或错误类型的指针来完成。
  • Using a pointer or reference to a function's local stack variables after that function has returned. 在该函数返回后使用指针或对函数的本地堆栈变量的引用。
  • Casting a pointer to a stack value to a pointer of the wrong size and using it. 将指向堆栈值的指针强制转换为错误大小的指针并使用它。

If you have a Unix system, "valgrind" is a good tool for finding some of these problems. 如果你有一个Unix系统,“valgrind”是找到这些问题的好工具。

I assume that since you say "My executable contains symbol table" that you compiled and linked with -g, and that your binary wasn't stripped. 我假设你说“我的可执行文件包含符号表”,你用-g编译和链接,并且你的二进制文件没有被剥离。

We can just confirm this: strings -a |grep function_name_you_know_should_exist 我们可以确认一下:字符串-a | grep function_name_you_know_should_exist

Also try using pstack on the core ans see if it does a better job of picking up the callstack. 也尝试在核心上使用pstack,看看它是否能更好地获取callstack。 In that case it sounds like your gdb is out of date compared to your gcc/g++ version. 在这种情况下,听起来你的gdb与你的gcc / g ++版本相比已经过时了。

Sounds like you're not using the identical glibc version on your machine as the corefile was when it crashed on production. 听起来你没有在你的机器上使用相同的glibc版本,因为核心文件在生产时崩溃了。 Get the files output by "ldd ./appname" and load them onto your machine, then tell gdb where to look; 获取“ldd ./appname”输出的文件并将它们加载到您的机器上,然后告诉gdb在哪里查看;

set solib-absolute-prefix /path/to/libs

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM