简体   繁体   English

在 tensorflow-gpu>=1.15 中使用用户运算符时未定义的符号

[英]Undefined symbols when using user operator in tensorflow-gpu>=1.15

everybody.大家。 I wrote some user operators to extend tensorflow and tried to use CMake to compile the code to different shared libraries to fit different versions of tensorflow.我写了一些用户操作符来扩展 tensorflow 并尝试使用 CMake 将代码编译到不同的共享库以适应不同版本的 tensorflow。

It works fine with tensorflow-gpu<=1.14 but not with 1.15 and 2.0.它适用于 tensorflow-gpu<=1.14,但不适用于 1.15 和 2.0。 I got the following error when loading the library.加载库时出现以下错误。

tensorflow.python.framework.errors_impl.NotFoundError: build/lib/libtensorflow_ctext.so: undefined symbol: _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I tried nm build/lib/libtensorflow_ctext.so on 1.14 version and 2.0 version, both shared libraries have this undefined symbol in the middle.我在 1.14 版本和 2.0 版本上尝试了nm build/lib/libtensorflow_ctext.so ,两个共享库的中间都有这个未定义的符号。

U _ZN10tensorflow12OpDefBuilder4AttrENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

It seems that the program is going to find this symbol in the linked Tensorflow framework library libtensorflow_framework.so.程序似乎要在链接的 Tensorflow 框架库 libtensorflow_framework.so 中找到这个符号。 I searched libtensorflow_framework.so.2 for similar symbols and found several of them.我在 libtensorflow_framework.so.2 中搜索了类似的符号并找到了其中的几个。

0000000000cacc50 T _ZN10tensorflow12OpDefBuilder10DeprecatedEiSs
0000000000cace00 T _ZN10tensorflow12OpDefBuilder10SetShapeFnESt8functionIFNS_6StatusEPNS_15shape_inference16InferenceContextEEE
0000000000cacb20 T _ZN10tensorflow12OpDefBuilder13ControlOutputESs
0000000000cac980 T _ZN10tensorflow12OpDefBuilder13SetIsStatefulEv
0000000000cac970 T _ZN10tensorflow12OpDefBuilder14SetIsAggregateEv
0000000000cac960 T _ZN10tensorflow12OpDefBuilder16SetIsCommutativeEv
0000000000cac990 T _ZN10tensorflow12OpDefBuilder27SetAllowsUninitializedInputEv
0000000000cacb50 T _ZN10tensorflow12OpDefBuilder3DocESs
0000000000caca90 T _ZN10tensorflow12OpDefBuilder4AttrESs
0000000000cacac0 T _ZN10tensorflow12OpDefBuilder5InputESs
0000000000cacaf0 T _ZN10tensorflow12OpDefBuilder6OutputESs
0000000000cac830 T _ZN10tensorflow12OpDefBuilderC1ESs
0000000000cac830 T _ZN10tensorflow12OpDefBuilderC2ESs
0000000000c702d0 W _ZN10tensorflow12OpDefBuilderD1Ev
0000000000c702d0 W _ZN10tensorflow12OpDefBuilderD2Ev

The symbol _ZN10tensorflow12OpDefBuilder4AttrESs looks very similar but different in the last several letters.符号_ZN10tensorflow12OpDefBuilder4AttrESs看起来非常相似,但在最后几个字母中有所不同。 I don't really know what those "ESs"s and "ENSt7"s stand for.我真的不知道那些“ES”和“ENSt7”代表什么。

Hints on how I could debug it are very appreciated.非常感谢有关我如何调试它的提示。 Here is the command to build my shared library (generated by cmake)这是构建我的共享库的命令(由 cmake 生成)

g++ -fPIC   -shared -Wl,-soname,libtensorflow_ctext.so -o lib/libtensorflow_ctext.so src/CMakeFiles/bp_par_2d.dir/bp_par_2d.cc.o src/CMakeFiles/bp_par_2d_sv.dir/bp_par_2d_sv.cc.o src/CMakeFiles/fp_par_2d.dir/fp_par_2d.cc.o src/CMakeFiles/filter.dir/filter.cc.o cuda/CMakeFiles/bp_par_2d_cu.dir/bp_par_2d.cu.o cuda/CMakeFiles/bp_par_2d_sv_cu.dir/bp_par_2d_sv.cu.o cuda/CMakeFiles/fp_par_2d_cu.dir/fp_par_2d.cu.o cuda/CMakeFiles/filter_cu.dir/filter.cu.o tensorflow/CMakeFiles/bp_par_2d_ops.dir/bp_par_2d_ops.cu.o tensorflow/CMakeFiles/bp_par_2d_sv_ops.dir/bp_par_2d_sv_ops.cu.o tensorflow/CMakeFiles/fp_par_2d_ops.dir/fp_par_2d_ops.cu.o tensorflow/CMakeFiles/ramp_filter_ops.dir/ramp_filter_ops.cu.o CMakeFiles/tensorflow_ctext.dir/cmake_device_link.o  -L/usr/lib/x86_64-linux-gnu/stubs -Wl,-rpath,/home/ltl/anaconda3/envs/tf_test/lib/python3.7/site-packages/tensorflow_core /home/ltl/anaconda3/envs/tf_test/lib/python3.7/site-packages/tensorflow_core/libtensorflow_framework.so.2 -lcudadevrt -lcudart_static -lrt -lpthread -ldl 

Well, this problem is solved.嗯,这个问题解决了。

I used nm -C instruction to look inside the .so files and found that in Tensorflow>=1.15.0, the function is defined as我使用nm -C指令查看 .so 文件内部,发现在 Tensorflow>=1.15.0 中,函数定义为

0000000000caca90 T tensorflow::OpDefBuilder::Attr(std::string)

while in Tensorflow<=1.14.0, the function is defined as而在 Tensorflow<=1.14.0 中,函数定义为

0000000000c96ed0 T tensorflow::OpDefBuilder::Attr(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)

So, they use different settings on _GLIBCXX_USE_CXX11_ABI when compiling the shared library.因此,它们在编译共享库时在 _GLIBCXX_USE_CXX11_ABI 上使用不同的设置。

In order to be consistant and avoid those undefined symbol problems, I need to define -D_GLIBCXX_USE_CXX11_ABI=1 for early versions of Tensorflow and define -D_GLIBCXX_USE_CXX11_ABI=0 for later versions.为了保持一致并避免那些未定义的符号问题,我需要为-D_GLIBCXX_USE_CXX11_ABI=1的早期版本定义-D_GLIBCXX_USE_CXX11_ABI=1并为更高版本定义-D_GLIBCXX_USE_CXX11_ABI=0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM