简体   繁体   English

tf.Session() 上的分段错误(核心转储)

[英]Segmentation fault (core dumped) on tf.Session()

I am new with TensorFlow.我是 TensorFlow 的新手。

I just installed TensorFlow and to test the installation, I tried the following code and as soon as I initiate the TF Session, I am getting the Segmentation fault (core dumped) error.我刚刚安装了 TensorFlow 并测试安装,我尝试了以下代码,一旦我启动 TF 会话,我就会收到分段错误(核心转储)错误。

bafhf@remote-server:~$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/home/bafhf/anaconda3/envs/ismll/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> tf.Session()
2018-05-15 12:04:15.461361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1349] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:04:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
Segmentation fault (core dumped)

My nvidia-smi is:我的nvidia-smi是:

Tue May 15 12:12:26 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:04:00.0 Off |                    0 |
| N/A   38C    P8    26W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:05:00.0 Off |                    2 |
| N/A   31C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

And nvcc --version is:nvcc --version是:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

Also gcc --version is:另外gcc --version是:

gcc (Ubuntu 5.4.0-6ubuntu1~16.04.9) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Following is my PATH :以下是我的路径

/home/bafhf/bin:/home/bafhf/.local/bin:/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib:/home/bafhf/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin

and the LD_LIBRARY_PATH :LD_LIBRARY_PATH

/usr/local/cuda/bin:/usr/local/cuda/lib:/usr/local/cuda/extras/CUPTI/lib


I am running this on a server and I don't have root privileges.我在服务器上运行它,但我没有 root 权限。 Still I managed to install everything as per the instructions on the official website.尽管如此,我还是按照官方网站上的说明设法安装了所有内容。

Edit: New observations:编辑:新观察:

Seems like the GPU is allocating memory for the process for a second and then the core segmentation dumped error is thrown:似乎 GPU 正在为进程分配内存一秒钟,然后抛出核心分段转储错误:

终端输出

Edit2: Changed tensorflow version Edit2:更改了 tensorflow 版本

I downgraded my tensorflow version from v1.8 to v1.5.我将我的 tensorflow 版本从 v1.8 降级到 v1.5。 The issue still remains.问题仍然存在。


Is there any way address or debug this issue?有没有办法解决或调试这个问题?

This could possibly occur since you are using multiple GPUs here.这可能会发生,因为您在这里使用了多个 GPU。 Try setting cuda visible devices to just one of the GPUs.尝试将 cuda 可见设备仅设置为其中一个 GPU。 See this link for instructions on how to do that.有关如何执行操作的说明,请参阅此链接 In my case, this solved the problem.就我而言,这解决了问题。

If you can see the nvidia-smi output, the second GPU has an ECC code of 2. This error manifests itself irrespective of a CUDA version or TF version error, and usually as a segfault, and sometimes, with the CUDA_ERROR_ECC_UNCORRECTABLE flag in the stack trace.如果您可以看到nvidia-smi输出,则第二个 GPU 的ECC代码为 2。无论是 CUDA 版本还是 TF 版本错误,都会出现此错误,并且通常表现为段错误,有时还会在堆栈中带有CUDA_ERROR_ECC_UNCORRECTABLE标志痕迹。

I got to this conclusion from this post:我从这篇文章中得出了这个结论:

"Uncorrectable ECC error" usually refers to a hardware failure. “无法纠正的 ECC 错误”通常是指硬件故障。 ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. ECC 是纠错码,一种检测和纠正存储在 RAM 中的位错误的方法。 A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.杂散的宇宙射线可能会在很长一段时间内破坏存储在 RAM 中的一位,但“无法纠正的 ECC 错误”表示 RAM 存储中的几位“错误”——太多以至于 ECC 无法恢复原始位值。

This could mean that you have a bad or marginal RAM cell in your GPU device memory.这可能意味着您的 GPU 设备内存中有一个坏的或边缘的 RAM 单元。

Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.任何类型的边缘电路都可能不会 100% 失效,但在大量使用的压力下更有可能失效 - 以及相关的温度升高。

A reboot usually is supposed to take away the ECC error.重新启动通常应该消除ECC错误。 If not, seems like the only option is to change the hardware.如果没有,似乎唯一的选择是更换硬件。


So what all I did and finally how I fixed the issue?那么我做了什么,最后我是如何解决这个问题的?

  1. I tested my code a on a separate machcine with NVIDIA 1050 Ti machine and my code executed perfectly fine.我在带有 NVIDIA 1050 Ti 机器的单独机器上测试了我的代码,我的代码执行得非常好。
  2. I made the code run only on the first card for which the ECC value was normal, just to narrow down the issue.我让代码只在ECC值正常的第一张卡上运行,只是为了缩小问题的范围。 This I did following, this post, setting the CUDA_VISIBLE_DEVICES environment variable.我在这篇文章中做了这个,设置了CUDA_VISIBLE_DEVICES环境变量。
  3. I then requested for restart of the Tesla-K80 server to check whether a restart can fix this issue, they took a while but the server was then restarted然后我请求重新启动Tesla-K80 服务器以检查重新启动是否可以解决此问题,他们花了一段时间但服务器随后重新启动

    Now the issue is no more and I can run both the cards for my tensorflow implemntations.现在问题不复存在了,我可以为我的 tensorflow 实现运行这两张卡。

In case anyone still interested in, I happened to had the same issue, with "Volatile Uncorr. ECC" output.如果有人仍然感兴趣,我碰巧遇到了同样的问题,输出“Volatile Uncorr. ECC”。 My problem was incompatible versions as shown below:我的问题是版本不兼容,如下所示:

Loaded runtime CuDNN library: 7.1.1 but source was compiled with: 7.2.1.加载的运行时 CuDNN 库:7.1.1 但源代码编译为:7.2.1。 CuDNN library major and minor version needs to match or have higher minor version in case of CuDNN 7.0 or later version.如果是CuDNN 7.0或更高版本,CuDNN库主要和次要版本需要匹配或具有更高的次要版本。 If using a binary install, upgrade your CuDNN library.如果使用二进制安装,请升级您的 CuDNN 库。 If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.如果从源代码构建,请确保在运行时加载的库与编译配置期间指定的版本兼容。 Segmentation fault分段故障

After I upgrade CuDNN library to 7.3.1 (which is greater than 7.2.1), segmentation fault error disappeared.在我将 CuDNN 库升级到 7.3.1(大于 7.2.1)后,segmentation fault 错误消失了。 To upgrade I did the following (as also documented in here ).为了升级,我执行了以下操作(也记录在此处)。

  1. Download CuDNN library from NVIDIA websiteNVIDIA 网站下载 CuDNN 库
  2. sudo tar -xzvf [TAR_FILE]须藤 tar -xzvf [TAR_FILE]
  3. sudo cp cuda/include/cudnn.h /usr/local/cuda/include须藤cp cuda/include/cudnn.h /usr/local/cuda/include
  4. sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64须藤cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
  5. sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*须藤 chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Check that you are using the exact version of CUDA and CuDNN required by tensorflow, and also that you are using the version of driver of the graphics card that comes with this CUDA version .检查您使用的是 tensorflow 所需的确切版本的 CUDA 和 CuDNN,以及您使用的是此 CUDA 版本附带的显卡驱动程序版本

I once had a similar issue having a driver that was too recent.我曾经遇到过类似的问题,因为驱动程序太新了。 Downgrading it to the version coming with the CUDA version required by tensorflow solved the issue for me.将其降级到 tensorflow 所需的 CUDA 版本附带的版本为我解决了这个问题。

I encounter this problem recently.我最近遇到这个问题。

The reason is multiple GPUs in docker container.原因是 docker 容器中有多个 GPU。 The solution is pretty simple, you either:解决方案非常简单,您可以:

set CUDA_VISIBLE_DEVICES in host refers to https://stackoverflow.com/a/50464695/2091555主机中设置CUDA_VISIBLE_DEVICES指的是https://stackoverflow.com/a/50464695/2091555

or或者

use --ipc=host to launch the docker if you need multiple GPUs eg如果您需要多个 GPU,请使用--ipc=host启动--ipc=host例如

docker run --runtime nvidia --ipc host \
  --rm -it
  nvidia/cuda:10.0-cudnn7-runtime-ubuntu16.04:latest

This problem is actually pretty nasty, and segfault happens during cuInit() calls in docker container and everything works fine in the host.这个问题实际上非常讨厌,并且在cuInit()容器中的cuInit()调用期间发生了段cuInit() ,并且在主机中一切正常。 I will leave log here to let the search engine find this answer easier for other people.我会在这里留下日志,让搜索引擎更容易为其他人找到这个答案。

(base) root@e121c445c1eb:~# conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
Collecting package metadata (current_repodata.json): / Segmentation fault (core dumped)

(base) root@e121c445c1eb:~# gdb python /data/corefiles/core.conda.572.1569384636
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 572]
[New LWP 576]

warning: Unexpected size of section `.reg-xstate/572' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/opt/conda/bin/python /opt/conda/bin/conda upgrade conda'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/572' in core file.
#0  0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
[Current thread is 1 (Thread 0x7f82bbfd7700 (LWP 572))]
(gdb) bt
#0  0x00007f829f0a55fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#1  0x00007f829f06e3a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#2  0x00007f829f07002c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so
#3  0x00007f829f0e04f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so
#4  0x00007f82b99a1ec0 in ffi_call_unix64 () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#5  0x00007f82b99a187d in ffi_call () from /opt/conda/lib/python3.7/lib-dynload/../../libffi.so.6
#6  0x00007f82b9bb7f7e in _call_function_pointer (argcount=1, resmem=0x7ffded858980, restype=<optimized out>, atypes=0x7ffded858940, avalues=0x7ffded858960, pProc=0x7f829f0e0380 <cuInit>, 
    flags=4353) at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:827
#7  _ctypes_callproc () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/callproc.c:1184
#8  0x00007f82b9bb89b4 in PyCFuncPtr_call () at /usr/local/src/conda/python-3.7.3/Modules/_ctypes/_ctypes.c:3969
#9  0x000055c05db9bd2b in _PyObject_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:199
#10 0x000055c05dbf7026 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4619
#11 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
#12 0x000055c05db9a79b in function_code_fastcall (globals=<optimized out>, nargs=0, args=<optimized out>, co=<optimized out>)
    at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:283
#13 _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:408
#14 0x000055c05dbf2846 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:4616
#15 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1553721932202/work/Python/ceval.c:3124
... (stack omitted)
#46 0x000055c05db9aa27 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1553721932202/work/Objects/call.c:433
---Type <return> to continue, or q <return> to quit---q
Quit

Another try is using pip to install另一种尝试是使用 pip 安装

(base) root@e121c445c1eb:~# pip install torch torchvision
(base) root@e121c445c1eb:~# python
Python 3.7.3 (default, Mar 27 2019, 22:11:17) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
Segmentation fault (core dumped)

(base) root@e121c445c1eb:~# gdb python /data/corefiles/core.python.28.1569385311 
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.

warning: core file may not match specified executable file.
[New LWP 28]

warning: Unexpected size of section `.reg-xstate/28' in core file.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
bt
Core was generated by `python'.
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Unexpected size of section `.reg-xstate/28' in core file.
#0  0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffaa1d995fb in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffaa1d623a5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffaa1d6402c in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffaa1dd44f7 in cuInit () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007ffaee75f724 in cudart::globalState::loadDriverInternal() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#5  0x00007ffaee760643 in cudart::__loadDriverInternalUtil() () from /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so
#6  0x00007ffafe2cda99 in __pthread_once_slow (once_control=0x7ffaeebe2cb0 <cudart::globalState::loadDriver()::loadDriverControl>, 
... (stack omitted)

I was also facing the same issue.我也面临同样的问题。 I have a workaround for the same you can try that.我有一个解决方法,你可以试试。

I followed the following steps: 1. Reinstall the python 3.5 or above 2. Reinstall the Cuda and Add the Cudnn libraries to it.我按照以下步骤操作: 1. 重新安装 python 3.5 或更高版本 2. 重新安装 Cuda 并将 Cudnn 库添加到其中。 3. Reinstall Tensorflow 1.8.0 GPU version. 3. 重新安装 Tensorflow 1.8.0 GPU 版本。

I am using tensorflow in a cloud enviornment from paperspace.我在纸空间的云环境中使用 tensorflow。

Update of cuDNN 7.3.1 did not work for me. cuDNN 7.3.1 的更新对我不起作用。

One way is to build Tensorflow with proper GPU and CPU support.一种方法是构建具有适当 GPU 和 CPU 支持的 Tensorflow。

This is not proper solution but this solved my issue temporarily (downgrade tensoflow to 1.5.0):这不是正确的解决方案,但这暂时解决了我的问题(将 tensoflow 降级到 1.5.0):

pip uninstall tensorflow-gpu
pip install tensorflow==1.5.0
pip install numpy==1.14.0
pip install six==1.10.0
pip install joblib==0.12

Hope this helps !希望这可以帮助 !

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM