[英]Core dump with SIGFPE for non-zero division
I have a qemu-kvm process suspiciously core dumped with SIGFPE: 我有一个可疑的qemu-kvm进程,其核心被SIGFPE丢弃了:
Program terminated with signal 8, Arithmetic exception.
#0 bdrv_exceed_io_limits (bs=0x7f75916b7270, is_write=false, nb_sectors=1)
at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3730
3730 elapsed_time /= (NANOSECONDS_PER_SECOND);
Where elapsed_time
is double
(the value in gdb output below) and NANOSECONDS_PER_SECOND
is a macro: 其中
elapsed_time
是double
(下面的gdb输出中的值),而NANOSECONDS_PER_SECOND
是宏:
#define NANOSECONDS_PER_SECOND 1000000000.0
I can't think of a reason how should could cause SIGFPE. 我想不出应该怎么引起SIGFPE的原因。 Any clue?
有什么线索吗?
Scenario: I'm using RHEL-6.5 as the host and trying to start a windows guest. 场景:我使用RHEL-6.5作为主机,并尝试启动Windows guest虚拟机。 It is steadily reproducible with the same command.
使用相同的命令可以稳定地重现。
Full backtrace: 完整回溯:
(gdb) bt
#0 bdrv_exceed_io_limits (bs=0x7ffff86f9270, is_write=false, nb_sectors=1) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3730
#1 bdrv_io_limits_intercept (bs=0x7ffff86f9270, is_write=false, nb_sectors=1) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:181
#2 0x00007ffff7e0bf6d in bdrv_co_do_readv (bs=0x7ffff86f9270, sector_num=0, nb_sectors=1, qiov=0x7fffe8000ab8, flags=<value optimized out>)
at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:2136
#3 0x00007ffff7e0c293 in bdrv_co_do_rw (opaque=0x7fffe8000b00) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3880
#4 0x00007ffff7e125eb in coroutine_trampoline (i0=<value optimized out>, i1=<value optimized out>)
at /usr/src/debug/qemu-kvm-0.12.1.2/coroutine-ucontext.c:129
#5 0x00007ffff5718ba0 in ?? () from /lib64/libc.so.6
#6 0x00007fffffffbf60 in ?? ()
#7 0x0000000000000000 in ?? ()
(gdb) disass
0x00007ffff7e0b6ae <+190>: mov 0x8a0(%rbx),%rax
0x00007ffff7e0b6b5 <+197>: test %rax,%rax
=> 0x00007ffff7e0b6b8 <+200>: divsd 0x170660(%rip),%xmm0 # 0x7ffff7f7bd20
0x00007ffff7e0b6c0 <+208>: je 0x7ffff7e0b950 <bdrv_io_limits_intercept+864>
0x00007ffff7e0b6c6 <+214>: mov 0x888(%rbx),%rsi
(gdb) x/gf 0x7ffff7f7bd20
0x7ffff7f7bd20: 1000000000
(gdb) p elapsed_time
$3 = 919718
(gdb) p $_siginfo
$1 = {si_signo = 8, si_errno = 0, si_code = 6, _sifields = {_pad = {-136186690, 32767, 4244976, 0, -560757824, 32767, -
-560757344, 32767, 0, 0, 0, 0, 0, 0, 34884976, 0, -136186690, 32767, 34884976, 0, 4258127, 0, 0, 0, -55876128, 3265
-136186690, si_uid = 32767}, _timer = {si_tid = -136186690, si_overrun = 32767, si_sigval = {sival_int = 4244976, s
_rt = {si_pid = -136186690, si_uid = 32767, si_sigval = {sival_int = 4244976, sival_ptr = 0x40c5f0}}, _sigchld = {s
si_uid = 32767, si_status = 4244976, si_utime = -2408436515056123904, si_stime = -584917379700457473}, _sigfault
0x7ffff7e1f4be}, _sigpoll = {si_band = 140737352168638, si_fd = 4244976}}}
So, what could be wrong with this divsd
instruction? 那么,这个
divsd
指令可能有什么问题呢? Any suggestion on how to debug it? 关于如何调试它的任何建议?
Answer it myself: This is a kernel bug that sets mxcsr accidentally to some bad value, Linux kernel triggers SIGFPE code INEXACT when the bit is not masked properly. 我自己回答: 这是一个内核错误,将mxcsr意外设置为某个错误值,当该位未正确屏蔽时,Linux内核会触发SIGFPE代码INEXACT。
SIGFPE in your code is not due to divided by zero but because of some of following reasons: 您代码中的SIGFPE不是由于被零除,而是由于以下一些原因:
FPE_FLTOVF_TRAP: Floating overflow trap. FPE_FLTOVF_TRAP:浮动溢出陷阱。
FPE_FLTUND_TRAP: Floating underflow trap. FPE_FLTUND_TRAP:浮动下溢陷阱。 (Trapping on floating underflow is not normally enabled.)
(通常不启用对浮动下溢的陷阱。)
Macro: int SIGFPE:
巨集:int SIGFPE:
The SIGFPE signal reports a fatal arithmetic error.
SIGFPE信号报告致命的算术错误。 Although the name is derived from "floating-point exception", this signal actually covers all arithmetic errors, including division by zero and overflow.
尽管该名称源自“浮点异常”,但该信号实际上涵盖了所有算术错误,包括零除和溢出。 If a program stores integer data in a location which is then used in a floating-point operation, this often causes an "invalid operation" exception, because the processor cannot recognize the data as a floating-point number.
如果程序将整数数据存储在随后用于浮点运算的位置,则这通常会导致“无效运算”异常,因为处理器无法将数据识别为浮点数。
Actual floating-point exceptions are a complicated subject because there are many types of exceptions with subtly different meanings, and the SIGFPE signal doesn't distinguish between them.
实际的浮点异常是一个复杂的主题,因为有许多类型的异常含义各有不同,并且SIGFPE信号无法区分它们。 The IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) defines various floating-point exceptions and requires conforming computer systems to report their occurrences.
IEEE二进制浮点算术标准(ANSI / IEEE Std 754-1985)定义了各种浮点异常,并要求使用合格的计算机系统来报告其发生情况。 However, this standard does not specify how the exceptions are reported, or what kinds of handling and control the operating system can offer to the programmer.
但是,该标准未指定异常的报告方式,或操作系统可以为程序员提供的处理和控制方式。
Because: 因为:
NANOSECONDS_PER_SECOND
= 1000000000.0
and elapsed_time
= 919718
so elapsed_time /= (NANOSECONDS_PER_SECOND);
NANOSECONDS_PER_SECOND
= 1000000000.0
并且elapsed_time
= 919718
因此elapsed_time /= (NANOSECONDS_PER_SECOND);
=> 919718 / 100 0000 000.0
== 0.0000919718
, I am sure this causes Floating underflow trap
that reason of SIGFPF. =>
919718 / 100 0000 000.0
== 0.0000919718
,我确定这是SIGFPF的原因导致Floating underflow trap
。
Floating overflow trap can't be a case because operation is divide. 浮动溢出陷阱不是一种情况,因为操作是分开的。
SIGFPE may not necessarily be seen until some time after the instruction causing it. SIGFPE可能不一定要等到导致它的指令之后的一段时间才能看到。 This is confusing of course.
当然,这令人困惑。
See https://stackoverflow.com/a/2219339/1442050 参见https://stackoverflow.com/a/2219339/1442050
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.