杀死由Bash脚本启动的所有进程

Question

我有许多bash脚本正在执行许多类似的任务，并且它们使用一些外部二进制程序。 问题在于二进制程序通常不会按预期退出。 由于我的脚本运行了数千次，因此很快就会出现这些进程的许多空闲/几乎死机的实例。 我无法修复这些程序，因此需要确保我的bash脚本正在终止它们。

SE中已经有一些主题可以处理终止bash脚本过程的任务。 我已经应用并测试了此处编写的内容，并且在某种程度上可以正常工作。 但这对于我的情况还不够好，我也不明白为什么，因此，我提出了一个新问题。

我的脚本具有层次结构，在这里以简化的方式显示：脚本A调用脚本B，脚本B并行调用脚本C的多个实例以使用所有CPU。 例如，脚本B并行运行5个脚本C实例，当脚本C的一个实例完成时，它将启动一个新的实例，总共运行了数千个脚本C。脚本C调用了几个外部二进制/命令，它们不能很好地终止。 它们在后台并行运行并相互通信。

但是，我的脚本C能够检测到外部命令何时完成工作，即使它们尚未终止，然后我的bash脚本也会退出。

为了在bash脚本完成期间终止所有外部程序，我添加了一个出口陷阱：

# Exit cleanup
cleanup_exit() {
    # Running the termination in an own process group to prevent it from preliminary termination. Since it will run in the background it will not cause any delays
    setsid nohup bash -c "
        touch /tmp/trace_1  # To see if this code was really executed to this point

        # Trapping signals to prevent that this function is terminated preliminary
        trap '' SIGINT SIGQUIT SIGTERM SIGHUP ERR
        touch /tmp/trace_2  # To see if this code was really executed to this point

        # Terminating the main processes
        kill ${pids[@]} 1>/dev/null 2>&1 || true
        touch /tmp/trace_3
        sleep 5
        touch /tmp/trace_4
        kill -9 ${pids[@]} 1>/dev/null 2>&1 || true
        touch /tmp/trace_5

        # Terminating the child processes of the main processes
        echo "Terminating the child processes"
        pkill -P ${pids[@]} 1>/dev/null 2>&1 || true
        touch /tmp/trace_6
        sleep 1
        pkill -9 -P ${pids[@]} 1>/dev/null 2>&1 || true
        touch /tmp/trace_7

        # Terminating everything else which is still running and which was started by this script
        pkill -P $$ || true
        touch /tmp/trace_8
        sleep 1
        pkill -9 -P $$ || true
        touch /tmp/trace_9
    "
}
trap "cleanup_exit" SIGINT SIGQUIT SIGTERM EXIT

现在，如果我仅并行运行很少的脚本C实例，这似乎可行。 如果我将数量增加到更多，例如10（工作站功能强大，并且应该能够并行处理脚本C和并行外部程序的数十个并行实例），则它将不再起作用，并且数百个外部实例程序正在迅速积累。

但是我不明白为什么。 例如，那些进程之一的PID累计为32048。在日志中，我可以看到出口陷阱的执行情况：

+ echo ' * Snapshot 190 completed after 3 seconds.'
 * Snapshot 190 completed after 3 seconds.
+ break
+ cleanup_exit
+ echo

+ echo ' * Cleaning up...'
 * Cleaning up...
+ setsid nohup bash -c '
        touch /tmp/trace_1  # To see if this code was really executed to this point

        # Trapping signals to prevent that this function is terminated preliminary
        trap '\'''\'' SIGINT SIGQUIT SIGTERM SIGHUP ERR
        touch /tmp/trace_2  # To see if this code was really executed to this point

        # Terminating the main processes
        kill 31678' '32048 1>/dev/null 2>&1 || true
        touch /tmp/trace_3
        sleep 5
        touch /tmp/trace_4
        kill -9 31678' '32048 1>/dev/null 2>&1 || true
        touch /tmp/trace_5

        # Terminating the child processes of the main processes
        pkill -P 31678' '32048 1>/dev/null 2>&1 || true
        touch /tmp/trace_6
        sleep 1
        pkill -9 -P 31678' '32048 1>/dev/null 2>&1 || true
        touch /tmp/trace_7

        # Terminating everything else which is still running and which was started by this script
        pkill -P 31623 || true
        touch /tmp/trace_8
        sleep 1
        pkill -9 -P 31623 || true
        touch /tmp/trace_9
    '

显然，此过程的PID用于退出陷阱，但该过程并未退出。 为了进行测试，我在此过程中再次手动运行kill命令，然后确实退出了。

最有趣的是，仅显示不超过5的跟踪文件。 没有超过5，但是为什么呢？

更新：我刚刚发现，即使我仅并行运行一个脚本C实例（即顺序运行），它也只能在一段时间内运行良好。 突然在某个时间点，流程不再终止，而是开始永远徘徊并积累。 不应通过一个进程并行使机器过载。 在我的日志文件中，退出陷阱仍然像以前一样被正确调用，没有区别。 内存也是可用的，CPU也是部分可用的。

Answer 1

任何外壳程序脚本的一个健全检查是在其上运行ShellCheck：

Line 9:
        kill ${pids[@]} 1>/dev/null 2>&1 || true
             ^-- SC2145: Argument mixes string and array. Use * or separate argument.

实际上，您的xtrace在这一行上做了一些奇怪的事情：

kill 31678' '32048 1>/dev/null 2>&1 || true
          ^^^--- What is this?

这里的问题是您的${pids[@]}扩展为多个单词，而bash -c仅解释第一个单词。 这是一个简化的示例：

pids=(2 3 4)
bash -c "echo killing ${pids[@]}"

最终写出killing 2而没有提及3或4的情况。这等同于运行

bash -c "echo killing 2" "3" "4"

其中其他pid只是成为位置参数$0和$1而不是已执行命令的一部分。

相反，就像ShellCheck所建议的那样，您希望*用空格将所有pid连接起来，并将它们作为单个参数插入：

pids=(2 3 4)
bash -c "echo killing ${pids[*]}"

其中显示killing 2 3 4 。

杀死由Bash脚本启动的所有进程

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-10-01 16:56:37

杀死由Bash脚本启动的所有进程

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-10-01 16:56:37

解决方案1
5 已采纳 2017-10-01 16:56:37