繁体   English   中英

tf.Session() 上的分段错误(核心转储)

[英]Segmentation fault (core dumped) on tf.Session()

我是 TensorFlow 的新手。

我刚刚安装了 TensorFlow 并测试安装,我尝试了以下代码,一旦我启动 TF 会话,我就会收到分段错误(核心转储)错误。

bafhf@remote-server:~$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/home/bafhf/anaconda3/envs/ismll/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> tf.Session()
2018-05-15 12:04:15.461361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1349] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
Segmentation fault (core dumped)

我的nvidia-smi是:

Tue May 15 12:12:26 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:04:00.0 Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:05:00.0 Off |                    2 |
| N/A   31C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

nvcc --version是:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

另外gcc --version是:

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

以下是我的路径

/home/bafhf/bin:/home/bafhf/.local/bin:/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib:/home/bafhf/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

LD_LIBRARY_PATH

/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib


我在服务器上运行它,但我没有 root 权限。 尽管如此,我还是按照官方网站上的说明设法安装了所有内容。

编辑:新观察:

似乎 GPU 正在为进程分配内存一秒钟,然后抛出核心分段转储错误:

终端输出

Edit2:更改了 tensorflow 版本

我将我的 tensorflow 版本从 v1.8 降级到 v1.5。 问题仍然存在。


有没有办法解决或调试这个问题?

这可能会发生,因为您在这里使用了多个 GPU。 尝试将 cuda 可见设备仅设置为其中一个 GPU。 有关如何执行操作的说明,请参阅此链接 就我而言,这解决了问题。

如果您可以看到nvidia-smi输出,则第二个 GPU 的ECC代码为 2。无论是 CUDA 版本还是 TF 版本错误,都会出现此错误,并且通常表现为段错误,有时还会在堆栈中带有CUDA_ERROR_ECC_UNCORRECTABLE标志痕迹。

我从这篇文章中得出了这个结论:

“无法纠正的 ECC 错误”通常是指硬件故障。 ECC 是纠错码,一种检测和纠正存储在 RAM 中的位错误的方法。 杂散的宇宙射线可能会在很长一段时间内破坏存储在 RAM 中的一位,但“无法纠正的 ECC 错误”表示 RAM 存储中的几位“错误”——太多以至于 ECC 无法恢复原始位值。

这可能意味着您的 GPU 设备内存中有一个坏的或边缘的 RAM 单元。

任何类型的边缘电路都可能不会 100% 失效,但在大量使用的压力下更有可能失效 - 以及相关的温度升高。

重新启动通常应该消除ECC错误。 如果没有,似乎唯一的选择是更换硬件。


那么我做了什么,最后我是如何解决这个问题的?

  1. 我在带有 NVIDIA 1050 Ti 机器的单独机器上测试了我的代码,我的代码执行得非常好。
  2. 我让代码只在ECC值正常的第一张卡上运行,只是为了缩小问题的范围。 我在这篇文章中做了这个,设置了CUDA_VISIBLE_DEVICES环境变量。
  3. 然后我请求重新启动Tesla-K80 服务器以检查重新启动是否可以解决此问题,他们花了一段时间但服务器随后重新启动

    现在问题不复存在了,我可以为我的 tensorflow 实现运行这两张卡。

如果有人仍然感兴趣,我碰巧遇到了同样的问题,输出“Volatile Uncorr. ECC”。 我的问题是版本不兼容,如下所示:

加载的运行时 CuDNN 库:7.1.1 但源代码编译为:7.2.1。 如果是CuDNN 7.0或更高版本,CuDNN库主要和次要版本需要匹配或具有更高的次要版本。 如果使用二进制安装,请升级您的 CuDNN 库。 如果从源代码构建,请确保在运行时加载的库与编译配置期间指定的版本兼容。 分段故障

在我将 CuDNN 库升级到 7.3.1(大于 7.2.1)后,segmentation fault 错误消失了。 为了升级,我执行了以下操作(也记录在此处)。

  1. NVIDIA 网站下载 CuDNN 库
  2. 须藤 tar -xzvf [TAR_FILE]
  3. 须藤cp cuda/include/cudnn.h /usr/local/cuda/include
  4. 须藤cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
  5. 须藤 chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

检查您使用的是 tensorflow 所需的确切版本的 CUDA 和 CuDNN,以及您使用的是此 CUDA 版本附带的显卡驱动程序版本

我曾经遇到过类似的问题,因为驱动程序太新了。 将其降级到 tensorflow 所需的 CUDA 版本附带的版本为我解决了这个问题。

我最近遇到这个问题。

原因是 docker 容器中有多个 GPU。 解决方案非常简单,您可以:

主机中设置CUDA_VISIBLE_DEVICES指的是https://stackoverflow.com/a/50464695/2091555

或者

如果您需要多个 GPU,请使用--ipc=host启动--ipc=host例如

docker run --runtime nvidia --ipc host \
  --rm -it
  nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04:latest

这个问题实际上非常讨厌,并且在cuInit()容器中的cuInit()调用期间发生了段cuInit() ,并且在主机中一切正常。 我会在这里留下日志,让搜索引擎更容易为其他人找到这个答案。

(base) root@e121c445c1eb:~# conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
Collecting package metadata (current_repodata.json): / Segmentation fault (core dumped)

(base) root@e121c445c1eb:~# gdb python /data/corefiles/core.conda.572.1569384636
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 572]
[New LWP 576]

warning: Unexpected size of section `.reg-xstate/572' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python /opt/conda/bin/conda upgrade conda'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/572' in core file.
#0  0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
[Current thread is 1 (Thread 0x7f82bbfd7700 (LWP 572))]
(gdb) bt
#0  0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f829f06e3a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f829f07002c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f829f0e04f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f82b99a1ec0 in ffi_call_unix64 () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#5  0x00007f82b99a187d in ffi_call () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#6  0x00007f82b9bb7f7e in _call_function_pointer (argcount=1, resmem=0x7ffded858980, restype=<optimized out>, atypes=0x7ffded858940, avalues=0x7ffded858960, pProc=0x7f829f0e0380 <cuInit>, 
    flags=4353) at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:827
#7  _ctypes_callproc () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:1184
#8  0x00007f82b9bb89b4 in PyCFuncPtr_call () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/_ctypes.c:3969
#9  0x000055c05db9bd2b in _PyObject_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:199
#10 0x000055c05dbf7026 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4619
#11 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
#12 0x000055c05db9a79b in function_code_fastcall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:283
#13 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:408
#14 0x000055c05dbf2846 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4616
#15 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
... (stack omitted)
#46 0x000055c05db9aa27 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:433
---Type <return> to continue, or q <return> to quit---q
Quit

另一种尝试是使用 pip 安装

(base) root@e121c445c1eb:~# pip install torch torchvision
(base) root@e121c445c1eb:~# python
Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
Segmentation fault (core dumped)

(base) root@e121c445c1eb:~# gdb python /data/corefiles/core.python.28.1569385311 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 28]

warning: Unexpected size of section `.reg-xstate/28' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
bt
Core was generated by `python'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/28' in core file.
#0  0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffaa1d623a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffaa1d6402c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffaa1dd44f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffaee75f724 in cudart::globalState::loadDriverInternal() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007ffaee760643 in cudart::__loadDriverInternalUtil() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007ffafe2cda99 in __pthread_once_slow (once_control=0x7ffaeebe2cb0 <cudart::globalState::loadDriver()::loadDriverControl>, 
... (stack omitted)

我也面临同样的问题。 我有一个解决方法,你可以试试。

我按照以下步骤操作: 1. 重新安装 python 3.5 或更高版本 2. 重新安装 Cuda 并将 Cudnn 库添加到其中。 3. 重新安装 Tensorflow 1.8.0 GPU 版本。

我在纸空间的云环境中使用 tensorflow。

cuDNN 7.3.1 的更新对我不起作用。

一种方法是构建具有适当 GPU 和 CPU 支持的 Tensorflow。

这不是正确的解决方案,但这暂时解决了我的问题(将 tensoflow 降级到 1.5.0):

pip uninstall tensorflow-gpu
pip install tensorflow==1.5.0
pip install numpy==1.14.0
pip install six==1.10.0
pip install joblib==0.12

希望这可以帮助 !

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM