为什么等待完成的父 shell 进程不能可靠地接收从 Bash 脚本中的后台作业发送的 USR1 信号？

Question

I have a Bash script running a bunch of background jobs in parallel.我有一个 Bash 脚本并行运行一堆后台作业。 Under certain conditions, before a background job completes, it sends a USR1 signal to the spawning Bash process (say, to inform that some process that was run as a part of the job had terminated with a nonzero exit code).在某些情况下，在后台作业完成之前，它会向生成的 Bash 进程发送一个 USR1 信号（例如，通知作为作业一部分运行的某个进程已以非零退出代码终止）。

In a simplified form, the script is equivalent to the one shown below.在简化形式中，该脚本等同于如下所示的脚本。 Here, for simplicity, each background job always sends a USR1 signal before completion, unconditionally (via the signalparent() function).这里，为了简单起见，每个后台作业总是在完成之前无条件地发送一个 USR1 信号（通过signalparent()函数）。

signalparent() { kill -USR1 $$; }
handlesignal() { echo 'USR1 signal caught' >&2; }
trap handlesignal USR1

for i in {1..10}; do
    {
        sleep 1
        echo "job $i finished" >&2
        signalparent
    } &
done
wait

When I run the above script (using Bash 3.2.57 on macOS 11.1, at least), I observe some behavior that I cannot explain, which makes me think that there is something in the interplay of Bash job management and signal trapping that I overlook.当我运行上述脚本时（至少在 macOS 11.1 上使用 Bash 3.2.57），我观察到一些我无法解释的行为，这让我认为 Bash 作业管理和信号捕获之间的相互作用是我忽略的.

Specifically, I would like to acquire an explanation for the following behaviors.具体来说，我想获得对以下行为的解释。

Almost always, when I run the script, I see fewer “signal caught” lines in the output (from the handlesignal() function) than there are jobs started in the for -loop—most of the time it is one to four of those lines that are printed for ten jobs being started.几乎总是，当我运行脚本时，我看到 output（来自handlesignal()函数）中的“信号捕获”行少于在for循环中启动的作业——大多数时候它是其中的一到四个为正在启动的十个作业打印的行数。
Why is it that, by the time the wait call completes, there are still background jobs whose signaling kill commands had not been yet executed?为什么在wait调用完成时，仍有后台作业的信号kill命令尚未执行？
At the same time, every so often, in some invocations of the script, I observe the kill command (from the signalparent() function) report an error regarding the originating process running the script (ie, the one with the $$ PID) no longer being present—see the output below.同时，每隔一段时间，在脚本的某些调用中，我观察到kill命令（来自signalparent()函数）报告有关运行脚本的原始进程（即具有$$ PID 的那个）的错误不再存在 - 请参阅下面的 output。
How come there are jobs whose signaling kill commands are still running while the parent shell process had already terminated?为什么在父 shell 进程已经终止时，有哪些作业的信号kill命令仍在运行？ It was my understanding that it is impossible for the parent process to terminate before all background jobs do, due to the wait call.我的理解是，由于wait调用，父进程不可能在所有后台作业完成之前终止。
```
 job 2 finished job 3 finished job 5 finished job 4 finished job 1 finished job 6 finished USR1 signal caught USR1 signal caught job 10 finished job 7 finished job 8 finished job 9 finished bash: line 3: kill: (19207) - No such process bash: line 3: kill: (19207) - No such process bash: line 3: kill: (19207) - No such process bash: line 3: kill: (19207) - No such process
```

Both of these behaviors signalize to me a presence of a race condition of some kind, whose origins I do not quite understand.这两种行为都向我表明存在某种竞争条件，我不太了解其起源。 I would appreciate if anyone could enlighten me on those, and perhaps even suggest how the script could be changed to avoid such race conditions.如果有人能在这些方面启发我，我将不胜感激，甚至可能建议如何更改脚本以避免这种竞争条件。

Answer 1

This is explained in the Bash Reference Manual as follows. Bash 参考手册中对此进行了如下说明。

When bash is waiting for an asynchronous command via the wait builtin, the reception of a signal for which a trap has been set will cause the wait builtin to return immediately with an exit status greater than 128, immediately after which the trap is executed.当 bash 通过wait builtin 等待异步命令时，接收到已设置陷阱的信号将导致wait builtin 立即返回，退出状态大于 128，然后立即执行陷阱。

So, you need to repeat wait until it returns 0 to make sure all background jobs have terminated, eg:因此，您需要重复wait ，直到它返回 0 以确保所有后台作业都已终止，例如：

until wait; do
    :
done

It was my understanding that it is impossible for the parent process to terminate before all background jobs do, due to the wait call.我的理解是，由于wait调用，父进程不可能在所有后台作业完成之前终止。

That is a misunderstanding;那是一种误解； wait may return due to reception of a signal for which a trap has been set while there are running jobs at the background, and that may lead to normal completion of the program, with the side effect of leaving those jobs orphaned.当后台有正在运行的作业时，由于接收到设置了陷阱的信号， wait可能会返回，这可能会导致程序正常完成，从而使这些作业成为孤立的。

Answer 2

Regarding 'Almost always, when I run the script, I see fewer “signal caught” lines in the output' —关于'几乎总是，当我运行脚本时，我在输出中看到更少的“信号捕获”行' -

According to signal(7) :根据信号（7）：

Standard signals do not queue .标准信号不排队。 If multiple instances of a standard signal are generated while that signal is blocked, then only one instance of the signal is marked as pending (and the signal will be delivered just once when it is unblocked).如果在该信号被阻塞时生成了一个标准信号的多个实例，那么只有一个信号实例被标记为待处理（并且该信号在解除阻塞时只会被传递一次）。

One way to change your script so that the signals do not arrive at the same time is as follows:更改脚本以使信号不会同时到达的一种方法如下：

signalparent() {
    kill -USR1 $$
}

ncaught=0
handlesignal() {
    (( ++ncaught ))
    echo "USR1 signal caught (#=$ncaught)" >&2
}
trap handlesignal USR1

for i in {1..10}; do
    {
        sleep $i
        signalparent
    } &
done

nwaited=0
while (( nwaited < 10 )); do
    wait && (( ++nwaited ))
done

Here is the output of the modified script with Bash 5.1 on macOS 10.15:这是在 macOS 10.15 上使用 Bash 5.1 修改的脚本的 output：

USR1 signal caught (#=1)
USR1 signal caught (#=2)
USR1 signal caught (#=3)
USR1 signal caught (#=4)
USR1 signal caught (#=5)
USR1 signal caught (#=6)
USR1 signal caught (#=7)
USR1 signal caught (#=8)
USR1 signal caught (#=9)
USR1 signal caught (#=10)

为什么等待完成的父 shell 进程不能可靠地接收从 Bash 脚本中的后台作业发送的 USR1 信号？

问题描述

2 个解决方案

解决方案1
4 已采纳 2020-12-29 08:11:49

解决方案2
2 2020-12-29 09:10:29

为什么等待完成的父 shell 进程不能可靠地接收从 Bash 脚本中的后台作业发送的 USR1 信号？

问题描述

2 个解决方案

解决方案1 4 已采纳 2020-12-29 08:11:49

解决方案2 2 2020-12-29 09:10:29

解决方案1
4 已采纳 2020-12-29 08:11:49

解决方案2
2 2020-12-29 09:10:29