简体   繁体   English

尝试加载 model 并启用 memory 增长时,Nvidia Xavier Jetson 出现 tensorflow 分段错误

[英]tensorflow segmentation fault in Nvidia Xavier Jetson when trying to load model with memory growth enabled

I have a segmentation fault with a very specific code sequence and only on Xavier Jetson:我有一个非常具体的代码序列的分段错误,并且只在 Xavier Jetson 上:

import os
import requests
import tensorflow as tf
  
# 1    
print('SET MEMORY GROWTH')
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)  

# 2
print(f'REQUESTS GET')
requests.get('https://speed.hetzner.de/100MB.bin')

# 3
command = 'ls'
print(f'SYSTEM CALL ({command})')
os.system(command)

# 4 
print('MODEL LOAD') 
model = tf.keras.models.load_model('mnv2_xavier.h5')

If I remove one of these steps the code will run without issues.如果我删除其中一个步骤,代码将毫无问题地运行。 I don't know if some other code sequences can lead to this same behavior, but I am pretty sure that they exist.我不知道其他一些代码序列是否会导致同样的行为,但我很确定它们存在。

I am trying to figure out what is the reason to have a segmentation fault here but, until now, I have no luck.我想弄清楚这里出现分段错误的原因是什么,但直到现在,我还没有运气。

I think than can be something related with tensorflow memory growth policy and with the fact of Xavier Jetson having shared memory between CPU and GPU.我认为可能与 tensorflow memory 增长政策以及 Xavier Jetson 在 CPU 和 GPU 之间共享 memory 的事实有关。

I would like to know if there is any way to solve this problem or a workaround and if someone have an explanation to this behavior.我想知道是否有任何方法可以解决此问题或解决方法,以及是否有人对此行为有解释。

Notes:笔记:

Code to create this model:创建此 model 的代码:

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.models import Model
from tensorflow.keras import Input

x = Input((224,244,3))
y = MobileNetV2()(x)
model = Model(x,y)
model.save('mnv2_xavier.h5')

Versions:版本:

Jetpack 4.4
tensorflow 2.3.0
keras 2.4.0
python 3.6.9

Output: Output:

2021-04-15 16:51:22.031610: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
SET MEMORY GROWTH
2021-04-15 16:51:25.349940: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-15 16:51:25.374098: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:949] ARM64 does not support NUMA - returning NUMA node zero
2021-04-15 16:51:25.374309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2021-04-15 16:51:25.374437: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-04-15 16:51:25.377470: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-04-15 16:51:25.379874: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-04-15 16:51:25.380541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-04-15 16:51:25.383268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-04-15 16:51:25.385455: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-04-15 16:51:25.385918: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-04-15 16:51:25.386201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:949] ARM64 does not support NUMA - returning NUMA node zero
2021-04-15 16:51:25.386633: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:949] ARM64 does not support NUMA - returning NUMA node zero
2021-04-15 16:51:25.386723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
REQUESTS GET
SYSTEM CALL (ls)
code          logs          logs2
bashc.sh      main-log.log  tests
Desktop       Documents     mnv2_xavier.h5
Downloads     model.py      Music
Videos        Pictures      go  
Public        segfault.py 
MODEL LOAD
2021-04-15 16:51:29.542399: W tensorflow/core/platform/profile_utils/cpu_utils.cc:108] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
2021-04-15 16:51:29.543521: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xcbba840 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-04-15 16:51:29.543595: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Segmentation fault (core dumped)

this error happens because the system is trying to use more memory than it should.发生此错误是因为系统正在尝试使用比应有的更多的 memory。 When the system does not allow this, it gives a Segmentation Fault error.当系统不允许这样做时,它会给出一个 Segmentation Fault 错误。 First, check the error file as follows.首先,检查错误文件如下。

$gdb python3
(gdb) run pythonfile.py

If the error is libapt-pkg5.0 install the appropriate package for your operating system For unix-based operating systems (Xaiver,Nano,TX2);如果错误是 libapt-pkg5.0,请为您的操作系统安装适当的 package 对于基于 unix 的操作系统(Xaiver、Nano、TX2);

$sudo dpkg --purge --force-depends apt apt-utils libapt-inst2.0:arm64 libapt-pkg5.0:arm64

If the error is still not resolved;如果错误仍未解决;

$gedit ~/.bashrc

Adding;添加;

export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM