简体   繁体   English

C ++ runtime_error捕获在群集节点之间不一致

[英]c++ runtime_error catching not consistent across cluster nodes

Problem 问题

I am trying to set up a c++ program to run on a Redhat scientific linux (v5.11) cluster. 我正在尝试设置一个可以在Redhat Scientific Linux(v5.11)集群上运行的c ++程序。 I have been able to compile the software and it runs flawlessly on the head node, however it crashes when run on any worker nodes. 我已经能够编译该软件,并且它可以在头节点上完美运行,但是在任何工作节点上运行时,它都会崩溃。

I have traced the issue to part of the code where if some conditions return false, a std::runtime_error is thrown. 我已将此问题追溯到部分代码,如果某些条件返回false,则会引发std::runtime_error This is deliberate as when the software is running correctly, this is caught and it continues to iterate. 这是有意为之的,因为当软件正确运行时,它会被捕获并继续迭代。 On the worker node the software aborts when the error is first thrown. 在工作程序节点上,首次引发错误时,软件将中止。 The abort code and backtrace is shown below. abort代码和backtrace显示如下。

As it works on one node but not the others my guess is that this is an issue of gcc versions. 因为它可以在一个节点上运行,但不能在其他节点上运行,所以我猜这是gcc版本的问题。 To compile I had to yum install devtoolset-2 and built the software using gcc 4.8.2 (Red Hat 4.8.2-15) as the system gcc 4.1.2 (Red Hat 4.1.2-55) was too old to correctly compile. 要进行编译,我必须yum install devtoolset-2并使用gcc 4.8.2 (Red Hat 4.8.2-15)构建软件,因为系统gcc 4.1.2 (Red Hat 4.1.2-55)太旧而无法正确编译。 。 When I launch the application on both nodes I have the following: 当我在两个节点上启动应用程序时,我将具有以下内容:

which gcc > /opt/rh/devtoolset-2/root/usr/bin/gcc
which c++ > /opt/rh/devtoolset-2/root/usr/bin/c++
which g++ > /opt/rh/devtoolset-2/root/usr/bin/g++
which gfortran > /opt/rh/devtoolset-2/root/usr/bin/gfortran
$LD_LIBRARY_PATH > /opt/rh/devtoolset-2/root/usr/lib64:/opt/rh/devtoolset-2/root/usr/lib

In terms of the differences between head and worker nodes, they are identical except in their kernel version: 就头节点和工作节点之间的区别而言,除了内核版本外,它们是相同的:

  • Head: Linux address.com 2.6.18-419.el5 #1 SMP x86_64 x86_64 x86_64 GNU/Linux 头:Linux address.com 2.6.18-419.el5#1 SMP x86_64 x86_64 x86_64 GNU / Linux
  • Worker: Linux address.com 2.6.18-164.11.1.el5 #1 SMP x86_64 x86_64 x86_64 GNU/Linux 工作者:Linux address.com 2.6.18-164.11.1.el5#1 SMP x86_64 x86_64 x86_64 GNU / Linux

Things I have tired: 我累的事情:

  • Running on worker nodes using SGE queue submission (with -V to pass environment) 使用SGE队列提交在工作节点上运行(通过-V传递环境)
  • Running directly on worker nodes via ssh worker and exporting all the same environment variables as on the head node 通过ssh worker直接在worker节点上运行,并导出与根节点上所有相同的环境变量
  • Compiling and running on the worker nodes 在工作节点上编译并运行

Any help would be greatly appreciated! 任何帮助将不胜感激! Here are a couple questions that I think getting answers to would help me narrow down the cause: 我认为以下几个问题会为您解答,这有助于我缩小原因:

  1. Is it work pursuing the kernel version difference? 追求内核版本差异是否可行?
  2. Does this look like an issue with libraries and paths rather than the code? 这看起来像是库和路径而不是代码的问题吗?
  3. Has the way in which c++ error handling changed between library versions? 库版本之间c ++错误处理的方式是否有所改变?
  4. Are there more debugging methods I could try to find the cause of this? 我还有其他调试方法可以尝试找出原因吗?

Extra info 额外信息

The abort is as follows: abort如下:

terminate called after throwing an instance of 'std::runtime_error'
  what():  'custom error message'

Program received signal SIGABRT, Aborted.
0x00000038b6830265 in raise () from /lib64/libc.so.6

The backtrace is as follows: backtrace如下:

#0  0x00000038b6830265 in raise () from /lib64/libc.so.6
#1  0x00000038b6831d10 in abort () from /lib64/libc.so.6
#2  0x00000038bb0bec44 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x00000038bb0bcdb6 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00000038bb0bcde3 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x00000038bb0bceca in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00002aaaab074bdc in Some::Function::Name() () from path/to/file.so

I must admit my knowledge of c++ is pretty limited, though iv been trying to improve over the last two days iv spent battling with this problem. 我必须承认我对c ++的了解非常有限,尽管iv在过去两天一直在努力改进,但iv一直在与这个问题作斗争。 Here is a simplified example of the code that throws and catches the error (This is obviously part of a much larger process that repeatedly calls Func1 ): 这是引发并捕获错误的代码的简化示例(这显然是重复调用Func1更大过程的一部分):

double Func1(int a, double b, int c)
  {

  for (bool OK = true ; OK && d > e && f < a ; f++)
    {
    try
      {
      for (d = 0, g = 1 ; g < 10 ; g *= 2)
        {
        Func2() ;
        }
      }
    catch (runtime_error problem)
      {
      *log << problem.what() ;
      OK = false ;
      }
    if (c > 1)
      {
      *log << f << d;
      }
    }

void Func2()
  {
  for (int j = 0 ; j < ny && (x & 5) > 0 ; j++)
    {
    if (Func3(j) <= 0.0)
      {
      throw runtime_error("custom error message") ;
      }
    Func4[j] = j ;
    }
  }

Running ldd on the compiled program (run on head node, line 1 missing on worker node): 在编译的程序上运行ldd (在头节点上运行,工作节点上缺少第1行):

linux-vdso.so.1 =>  (0x00007fff2b6e7000)
/users/username/software/version/Part1/Part1Extra.so (0x00002b3543587000)
libgfortran.so.3 => /usr/lib64/libgfortran.so.3 (0x00002b354385b000)
libm.so.6 => /lib64/libm.so.6 (0x0000003cc2000000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000315f800000)
libc.so.6 => /lib64/libc.so.6 (0x0000003cc1c00000)
/users/username/software/version/Part2/Part2.so (0x00002b3543b4f000)
/users/username/software/version/Part3/Part3.so (0x00002b3543d9b000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003160000000)
/lib64/ld-linux-x86-64.so.2 (0x0000003cc1800000)
/users/username/software/version/Part3/Part3Extra.so (0x00002b3543fb2000)

Finally managed to work out what was going on... 终于设法弄清楚发生了什么...

For those that find unexpected errors related to c++ code that runs normally on one part of a multi-node system but not on other parts even though the system shares a common file structure, this last point can be very misleading, although i'm sure rather obvious to those more clued up with system administration. 对于那些发现与c ++代码有关的意外错误的人,即使系统共享一个公共文件结构,该错误通常可以在多节点系统的一部分上正常运行,而在其他部分却不能正常运行,尽管我敢肯定,这最后一点可能会引起误解对于那些更熟悉系统管理的人来说,这是显而易见的。

Initially I was under the impression that the head and worker/computation nodes shared the entire file structure. 最初,我的印象是头节点和工作节点/计算节点共享整个文件结构。 This is only partly true as the worker nodes had access to certain parts of the file system, but importantly not the core files, such as /lib and /lib64. 这只是部分正确,因为工作节点可以访问文件系统的某些部分,但重要的是不能访问核心文件,例如/lib/lib64. Essentially all packages installed via yum were independent to each computation node. 基本上,所有通过yum安装的软件包都独立于每个计算节点。 Having updated the head node to the correct gcc version ( devtoolset-2 in this instance), I was under the impression that each worker node was also updated. 将头节点更新为正确的gcc版本(在本例中为devtoolset-2 )后,我的印象是每个工作节点也都进行了更新。 This was not the case. 此情况并非如此。

The Underlying Problem 潜在的问题

c++ code compiled using gcc 4.8.2 (Red Hat 4.8.2-15) worked on the head-node which had libstdc++.x86_64 (v4.1.2-55.el5) in catching a std::runtime_error error thrown. 使用gcc 4.8.2 (Red Hat 4.8.2-15)编译的C ++代码在具有libstdc++.x86_64 (v4.1.2-55.el5)的头节点上工作,捕获了引发std::runtime_error错误。 This error was not properly caught when running on worker nodes. 在工作节点上运行时未正确捕获此错误。

The problem was that the worker node system libstdc++.x86_64 version was old (unfortunately I can't remember what version exactly) which meant errors were not being caught. 问题在于工作程序节点系统libstdc++.x86_64版本较旧(不幸的是,我不记得确切的版本),这意味着没有发现错误。 It appears that a certain version of libstdc++ is unable to catch errors from code compiled using gcc 4.8.2 . 某些版本的libstdc++似乎无法捕获使用gcc 4.8.2编译的代码中的错误。

The Solution 解决方案

Each worker node had to be manually updated using yum so that its libstdc++ version was high enough to resolve this issue ( v4.1.2-55.el5 in our case). 每个工作节点都必须使用yum进行手动更新,以使其libstdc++版本足够高以解决此问题(在本例中为v4.1.2-55.el5 )。 Updating libstdc++ fixed the problem. 更新libstdc++解决此问题。

Extra Info 额外信息

In our case the worker nodes were unable to connect to the internet directly so yum had to be done via proxy. 在我们的情况下,工作节点无法直接连接到这样的互联网yum不得不通过代理完成。 Our version of yum was also too old to use the socks5h automatic ssh tunnelling proxy method. 我们的yum版本也太旧了,无法使用socks5h自动ssh隧道代理方法。 Because of this we had to use the squid package on the head node to allow a connection. 因此,我们必须在头节点上使用squid包来允许连接。

Finally, this took a while to figure out as the file structure differences were very misleading. 最终,花了一段时间才弄清文件结构的差异是非常令人误解的。 While the key folders are not shared between nodes, there contents looks exactly the same as the old/new versions of the packages had the exact same file structure, just with modified contents. 虽然密钥文件夹不在节点之间共享,但是那里的内容看起来与旧/新版本的软件包具有完全相同的文件结构,只是内容有所修改。

Once again, this is probably all very obvious to a system admin, but there you go. 再一次,这对于系统管理员来说可能是非常明显的,但是您就可以了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM