简体   繁体   中英

Core dump with SIGFPE for non-zero division

I have a qemu-kvm process suspiciously core dumped with SIGFPE:

Program terminated with signal 8, Arithmetic exception.
#0  bdrv_exceed_io_limits (bs=0x7f75916b7270, is_write=false, nb_sectors=1)
   at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3730
3730        elapsed_time  /= (NANOSECONDS_PER_SECOND);

Where elapsed_time is double (the value in gdb output below) and NANOSECONDS_PER_SECOND is a macro:

#define NANOSECONDS_PER_SECOND  1000000000.0

I can't think of a reason how should could cause SIGFPE. Any clue?

Scenario: I'm using RHEL-6.5 as the host and trying to start a windows guest. It is steadily reproducible with the same command.

Full backtrace:

(gdb) bt
#0  bdrv_exceed_io_limits (bs=0x7ffff86f9270, is_write=false, nb_sectors=1) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3730
#1  bdrv_io_limits_intercept (bs=0x7ffff86f9270, is_write=false, nb_sectors=1) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:181
#2  0x00007ffff7e0bf6d in bdrv_co_do_readv (bs=0x7ffff86f9270, sector_num=0, nb_sectors=1, qiov=0x7fffe8000ab8, flags=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:2136
#3  0x00007ffff7e0c293 in bdrv_co_do_rw (opaque=0x7fffe8000b00) at /usr/src/debug/qemu-kvm-0.12.1.2/block.c:3880
#4  0x00007ffff7e125eb in coroutine_trampoline (i0=<value optimized out>, i1=<value optimized out>)
    at /usr/src/debug/qemu-kvm-0.12.1.2/coroutine-ucontext.c:129
#5  0x00007ffff5718ba0 in ?? () from /lib64/libc.so.6
#6  0x00007fffffffbf60 in ?? ()
#7  0x0000000000000000 in ?? ()

(gdb) disass
   0x00007ffff7e0b6ae <+190>:   mov    0x8a0(%rbx),%rax
   0x00007ffff7e0b6b5 <+197>:   test   %rax,%rax
=> 0x00007ffff7e0b6b8 <+200>:   divsd  0x170660(%rip),%xmm0        # 0x7ffff7f7bd20
   0x00007ffff7e0b6c0 <+208>:   je     0x7ffff7e0b950 <bdrv_io_limits_intercept+864>
   0x00007ffff7e0b6c6 <+214>:   mov    0x888(%rbx),%rsi

(gdb) x/gf 0x7ffff7f7bd20
0x7ffff7f7bd20: 1000000000

(gdb) p elapsed_time
$3 = 919718

(gdb) p $_siginfo
$1 = {si_signo = 8, si_errno = 0, si_code = 6, _sifields = {_pad = {-136186690, 32767, 4244976, 0, -560757824, 32767, -
    -560757344, 32767, 0, 0, 0, 0, 0, 0, 34884976, 0, -136186690, 32767, 34884976, 0, 4258127, 0, 0, 0, -55876128, 3265
    -136186690, si_uid = 32767}, _timer = {si_tid = -136186690, si_overrun = 32767, si_sigval = {sival_int = 4244976, s
    _rt = {si_pid = -136186690, si_uid = 32767, si_sigval = {sival_int = 4244976, sival_ptr = 0x40c5f0}}, _sigchld = {s
      si_uid = 32767, si_status = 4244976, si_utime = -2408436515056123904, si_stime = -584917379700457473}, _sigfault 
    0x7ffff7e1f4be}, _sigpoll = {si_band = 140737352168638, si_fd = 4244976}}}

So, what could be wrong with this divsd instruction? Any suggestion on how to debug it?

Answer it myself: This is a kernel bug that sets mxcsr accidentally to some bad value, Linux kernel triggers SIGFPE code INEXACT when the bit is not masked properly.

SIGFPE in your code is not due to divided by zero but because of some of following reasons:

FPE_FLTOVF_TRAP: Floating overflow trap.
FPE_FLTUND_TRAP: Floating underflow trap. (Trapping on floating underflow is not normally enabled.)

Macro: int SIGFPE:

The SIGFPE signal reports a fatal arithmetic error. Although the name is derived from "floating-point exception", this signal actually covers all arithmetic errors, including division by zero and overflow. If a program stores integer data in a location which is then used in a floating-point operation, this often causes an "invalid operation" exception, because the processor cannot recognize the data as a floating-point number.

Actual floating-point exceptions are a complicated subject because there are many types of exceptions with subtly different meanings, and the SIGFPE signal doesn't distinguish between them. The IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) defines various floating-point exceptions and requires conforming computer systems to report their occurrences. However, this standard does not specify how the exceptions are reported, or what kinds of handling and control the operating system can offer to the programmer.

Because:
NANOSECONDS_PER_SECOND = 1000000000.0 and elapsed_time = 919718 so elapsed_time /= (NANOSECONDS_PER_SECOND); => 919718 / 100 0000 000.0 == 0.0000919718 , I am sure this causes Floating underflow trap that reason of SIGFPF.

Floating overflow trap can't be a case because operation is divide.

SIGFPE may not necessarily be seen until some time after the instruction causing it. This is confusing of course.

See https://stackoverflow.com/a/2219339/1442050

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM