由于分段错误导致生产机器上的C ++应用程序崩溃

Question

We are facing C++ application crash issue due to segmentation fault on RED hat Linux. 由于RED hat Linux上的分段错误，我们面临C ++应用程序崩溃的问题。 We are using embedded python in C++. 我们在C ++中使用嵌入式python。

Please find below my limitation 请在我的限制以下找到

Don't I have access to production machine where application crashes. 我没有访问应用程序崩溃的生产机器的权限。 Client send us core dump files when application crashes. 当应用程序崩溃时，客户端向我们发送核心转储文件。
Problem is not reproducible on our test machine which has exactly same configuration as production machine. 我们的测试机器与生产机器的配置完全相同，因此问题无法重现。
Sometime application crashes after 1 hour, 4 hour ….1 day or 1 week. 有时，应用程序在1小时，4小时….1天或1周后崩溃。 We haven't get time frame or any specific pattern in which application crashes. 我们还没有时间表或应用程序崩溃的任何特定模式。
Application is complex and embedded python code is used from lot of places from within application. 应用程序很复杂，并且在应用程序内的许多地方都使用嵌入式python代码。 We have done extensive code reviews but couldn't find the fix by doing code review. 我们已经进行了广泛的代码审查，但通过进行代码审查找不到修补程序。
As per stack trace in core dump, it is crashing around multiplication operation, reviewed code for such operation in code we haven't get any code where such operation is performed. 根据核心转储中的堆栈跟踪，它在乘法运算附近崩溃，在代码中检查了该运算的代码，我们没有得到执行该运算的任何代码。 Might be such operations are called through python scripts executed from embedded python on which we don't have control or we can't review it. 此类操作可能是通过无法控制或无法查看的嵌入式python执行的python脚本调用的。
We can't use any profiling tool on production environment like Valgrind. 我们不能在生产环境（如Valgrind）上使用任何性能分析工具。
We are using gdb on our local machine to analyze core dump. 我们在本地计算机上使用gdb分析核心转储。 We can't run gdb on production machine. 我们无法在生产机器上运行gdb。

Please find below the efforts we have putted in. 请在下面找到我们付出的努力。

We have analyzed logs and continuously fired request that coming towards our application on our test environment to reproduce the problem. 我们已经分析了日志，并在测试环境中不断向我们的应用程序发出请求，以重现该问题。
We are not getting crash point in logs. 我们没有在日志中找到崩溃点。 Every time we get different logs. 每次我们得到不同的日志。 I think this is due to; 我认为这是由于； Memory is smashed somewhere else and application crashes after sometime. 内存在其他地方被破坏，一段时间后应用程序崩溃。
We have checked load at any point on our application and it is never exceeded our application limit. 我们已在应用程序的任何位置检查了负载，但从未超过应用程序限制。
Memory utilization of our application is also normal. 我们的应用程序的内存利用率也很正常。
We have profiled our application with help of Valgrind in our test machine and removed valgrind errors but application is still crashing. 我们已经在测试机中的Valgrind的帮助下对应用程序进行了概要分析，并删除了valgrind错误，但应用程序仍然崩溃。

I appreciate any help to guide us to proceed further to solve the problem. 感谢您为指导我们进一步解决问题提供的帮助。

Below is the version details 以下是版本详细信息

Red hat linux server 5.6 (Tikanga) Python 2.6.2 GCC 4.1 Red Hat Linux服务器5.6（Tikanga）Python 2.6.2 GCC 4.1

Following is the stack trace I am getting from the core dump files they have shared (on my machine). 以下是我从它们共享的核心转储文件（在我的计算机上）中获得的堆栈跟踪。 FYI, We don't have access to production machine to run gdb on core dump files. 仅供参考，我们无权使用生产机器在核心转储文件上运行gdb。

0  0x00000033c6678630 in ?? ()
1  0x00002b59d0e9501e in PyString_FromFormatV (format=0x2b59d0f2ab00 "can't multiply sequence by non-int of type '%.200s'", vargs=0x46421f20) at Objects/stringobject.c:291
2  0x00002b59d0ef1620 in PyErr_Format (exception=0x2b59d1170bc0, format=<value optimized out>) at Python/errors.c:548
3  0x00002b59d0e4bf1c in PyNumber_Multiply (v=0x2aaaac080600, w=0x2b59d116a550) at Objects/abstract.c:1192
4  0x00002b59d0ede326 in PyEval_EvalFrameEx (f=0x732b670, throwflag=<value optimized out>) at Python/ceval.c:1119
5  0x00002b59d0ee2493 in call_function (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:3794
6  PyEval_EvalFrameEx (f=0x7269330, throwflag=<value optimized out>) at Python/ceval.c:2389
7  0x00002b59d0ee2493 in call_function (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:3794
8  PyEval_EvalFrameEx (f=0x70983f0, throwflag=<value optimized out>) at Python/ceval.c:2389
9  0x00002b59d0ee2493 in call_function (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:3794
10 PyEval_EvalFrameEx (f=0x6f1b500, throwflag=<value optimized out>) at Python/ceval.c:2389
11 0x00002b59d0ee2493 in call_function (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:3794
12 PyEval_EvalFrameEx (f=0x2aaab09d52e0, throwflag=<value optimized out>) at Python/ceval.c:2389
13 0x00002b59d0ee2d9f in ?? () at Python/ceval.c:2968 from /usr/local/lib/libpython2.6.so.1.0
14 0x0000000000000007 in ?? ()
15 0x00002b59d0e83042 in lookdict_string (mp=<value optimized out>, key=0x46424dc0, hash=40722104) at Objects/dictobject.c:412
16 0x00002aaab09d5458 in ?? ()
17 0x00002aaab09d5458 in ?? ()
18 0x00002aaab02a91f0 in ?? ()
19 0x00002aaab0b2c3a0 in ?? ()
20 0x0000000000000004 in ?? ()
21 0x00000000026d5eb8 in ?? ()
22 0x00002aaab0b2c3a0 in ?? ()
23 0x00002aaab071e080 in ?? ()
24 0x0000000046422bf0 in ?? ()
25 0x0000000046424dc0 in ?? ()
26 0x00000000026d5eb8 in ?? ()
27 0x00002aaab0987710 in ?? ()
28 0x00002b59d0ee2de2 in PyEval_EvalFrame (f=0x0) at Python/ceval.c:538
29 0x0000000000000000 in ?? ()

Answer 1

You are almost certainly doing something bad with pointers in your C++ code, which can be very tough to debug. 几乎可以肯定，在C ++代码中使用指针做不好的事情，这很难调试。

Do not assume that the stack trace is relevant. 不要假定堆栈跟踪是相关的。 It might be relevant, but pointer misuse can often lead to crashes some time later 这可能是相关的，但是指针滥用通常会在一段时间后导致崩溃
Build with full warnings on. 建立完整警告。 The compiler can point out some non-obvious pointer misuse, such as returning a reference to a local. 编译器可以指出一些不明显的指针滥用，例如将引用返回给本地。
Investigate your arrays. 研究您的阵列。 Try replacing arrays with std::vector (C++03) or std::array (C++11) so you can iterate using begin() and end() and you can index using at() . 尝试用std::vector （C ++ 03）或std::array （C ++ 11）替换数组，以便可以使用begin()和end()进行迭代，并可以使用at()进行索引。
Investigate your pointers. 研究您的指针。 Replace them with std::unique_ptr (C++11) or boost::scoped_ptr wherever you can (there should be no overhead in release builds). 尽可能将它们替换为std::unique_ptr （C ++ 11）或boost::scoped_ptr （发行版中应该没有开销）。 Replace the rest with shared_ptr or weak_ptr . 用shared_ptr或weak_ptr替换其余部分。 Any that can't be replaced are probably the source of problematic logic. 任何无法替代的问题都可能是逻辑问题的根源。

Because of the very problems you're seeing, modern C++ allows almost all raw pointer usage to be removed entirely. 由于您遇到的问题非常严重，现代C ++几乎可以完全删除所有原始指针用法。 Try it. 试试吧。

Answer 2

First things first, compile both your binary and libpython with debug symbols and push it out. 首先，用调试符号编译二进制文件和libpython并将其推出。 The stack trace will be much easier to follow. 堆栈跟踪将更容易遵循。

The relevant argument to g++ is -g . g++的相关参数是-g 。

Answer 3

Suggestions: 建议：

As already suggested, provide a complete debug build 如已经建议的，提供完整的调试版本
Provide a memory test tool and a CPU torture test 提供内存测试工具和CPU酷刑测试
Load debug symbols of python library when analyzing the core dump 分析核心转储时加载python库的调试符号
The stacktrace shows something concerning eval(), so I guess you do dynamic code generation and evaluation/execution. stacktrace显示了一些有关eval（）的信息，所以我想您可以进行动态代码生成和评估/执行。 If so, within this code, or passed arguments, there might be the actual error. 如果是这样，在此代码或传递的参数内，可能存在实际错误。 Assertions at any interface to the code and code dumps may help. 在代码和代码转储的任何接口上声明都可能会有所帮助。

由于分段错误导致生产机器上的C ++应用程序崩溃

问题描述

3 个解决方案

解决方案1
3 2013-01-24 14:20:11

解决方案2
1 2013-01-24 14:03:03

解决方案3
1 2013-01-24 14:16:50

由于分段错误导致生产机器上的C ++应用程序崩溃

问题描述

3 个解决方案

解决方案1 3 2013-01-24 14:20:11

解决方案2 1 2013-01-24 14:03:03

解决方案3 1 2013-01-24 14:16:50

解决方案1
3 2013-01-24 14:20:11

解决方案2
1 2013-01-24 14:03:03

解决方案3
1 2013-01-24 14:16:50