尝试加载 model 并启用 memory 增长时，Nvidia Xavier Jetson 出现 tensorflow 分段错误

Question

I have a segmentation fault with a very specific code sequence and only on Xavier Jetson:我有一个非常具体的代码序列的分段错误，并且只在 Xavier Jetson 上：

import os
import requests
import tensorflow as tf
  
# 1    
print('SET MEMORY GROWTH')
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)  

# 2
print(f'REQUESTS GET')
requests.get('https://speed.hetzner.de/100MB.bin')

# 3
command = 'ls'
print(f'SYSTEM CALL ({command})')
os.system(command)

# 4 
print('MODEL LOAD') 
model = tf.keras.models.load_model('mnv2_xavier.h5')

If I remove one of these steps the code will run without issues.如果我删除其中一个步骤，代码将毫无问题地运行。 I don't know if some other code sequences can lead to this same behavior, but I am pretty sure that they exist.我不知道其他一些代码序列是否会导致同样的行为，但我很确定它们存在。

I am trying to figure out what is the reason to have a segmentation fault here but, until now, I have no luck.我想弄清楚这里出现分段错误的原因是什么，但直到现在，我还没有运气。

I think than can be something related with tensorflow memory growth policy and with the fact of Xavier Jetson having shared memory between CPU and GPU.我认为可能与 tensorflow memory 增长政策以及 Xavier Jetson 在 CPU 和 GPU 之间共享 memory 的事实有关。

I would like to know if there is any way to solve this problem or a workaround and if someone have an explanation to this behavior.我想知道是否有任何方法可以解决此问题或解决方法，以及是否有人对此行为有解释。

Notes:笔记：

Code to create this model:创建此 model 的代码：

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.models import Model
from tensorflow.keras import Input

x = Input((224,244,3))
y = MobileNetV2()(x)
model = Model(x,y)
model.save('mnv2_xavier.h5')

Versions:版本：

Jetpack 4.4
tensorflow 2.3.0
keras 2.4.0
python 3.6.9

Output: Output：

2021-04-15 16:51:22.031610: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
SET MEMORY GROWTH
2021-04-15 16:51:25.349940: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-04-15 16:51:25.374098: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:949] ARM64 does not support NUMA - returning NUMA node zero
2021-04-15 16:51:25.374309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:00.0 name: Xavier computeCapability: 7.2
coreClock: 1.377GHz coreCount: 8 deviceMemorySize: 31.18GiB deviceMemoryBandwidth: 82.08GiB/s
2021-04-15 16:51:25.374437: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.2
2021-04-15 16:51:25.377470: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10
2021-04-15 16:51:25.379874: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-04-15 16:51:25.380541: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-04-15 16:51:25.383268: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-04-15 16:51:25.385455: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.10
2021-04-15 16:51:25.385918: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-04-15 16:51:25.386201: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:949] ARM64 does not support NUMA - returning NUMA node zero
2021-04-15 16:51:25.386633: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:949] ARM64 does not support NUMA - returning NUMA node zero
2021-04-15 16:51:25.386723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
REQUESTS GET
SYSTEM CALL (ls)
code          logs          logs2
bashc.sh      main-log.log  tests
Desktop       Documents     mnv2_xavier.h5
Downloads     model.py      Music
Videos        Pictures      go  
Public        segfault.py 
MODEL LOAD
2021-04-15 16:51:29.542399: W tensorflow/core/platform/profile_utils/cpu_utils.cc:108] Failed to find bogomips or clock in /proc/cpuinfo; cannot determine CPU frequency
2021-04-15 16:51:29.543521: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xcbba840 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-04-15 16:51:29.543595: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Segmentation fault (core dumped)

Answer 1

this error happens because the system is trying to use more memory than it should.发生此错误是因为系统正在尝试使用比应有的更多的 memory。 When the system does not allow this, it gives a Segmentation Fault error.当系统不允许这样做时，它会给出一个 Segmentation Fault 错误。 First, check the error file as follows.首先，检查错误文件如下。

$gdb python3
(gdb) run pythonfile.py

If the error is libapt-pkg5.0 install the appropriate package for your operating system For unix-based operating systems (Xaiver,Nano,TX2);如果错误是 libapt-pkg5.0，请为您的操作系统安装适当的 package 对于基于 unix 的操作系统（Xaiver、Nano、TX2）；

$sudo dpkg --purge --force-depends apt apt-utils libapt-inst2.0:arm64 libapt-pkg5.0:arm64

If the error is still not resolved;如果错误仍未解决；

$gedit ~/.bashrc

Adding;添加；

export LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libgomp.so.1

尝试加载 model 并启用 memory 增长时，Nvidia Xavier Jetson 出现 tensorflow 分段错误

问题描述

1 个解决方案

解决方案1
0 2022-06-16 15:52:47

尝试加载 model 并启用 memory 增长时，Nvidia Xavier Jetson 出现 tensorflow 分段错误

问题描述

1 个解决方案

解决方案1 0 2022-06-16 15:52:47

解决方案1
0 2022-06-16 15:52:47