简体   繁体   English

使用pthread_kill()来终止阻塞I / O的线程的同步问题

[英]Synchronization issue with usage of pthread_kill() to terminate thread blocked for I/O

Previously I had asked a question regarding how to terminate thread blocked for I/O. 以前我问了一个关于如何终止阻塞I / O的线程的问题 I have used pthread_kill() instead of pthread_cancel() or writing to pipes, considering few advantages. 考虑到一些优点,我使用了pthread_kill()而不是pthread_cancel()或写入管道。

I have implementing the code to send signal (SIGUSR2) to the target thread using pthread_kill() . 我已经使用pthread_kill()实现了将信号(SIGUSR2)发送到目标线程的代码。 Below is the skeleton code for this. 下面是这个的骨架代码。 Most of the times getTimeRemainedForNextEvent() returns a value that blocks poll() for several hours. 大多数时候getTimeRemainedForNextEvent()返回一个阻止poll()几个小时的值。 Because of this large timeout value, even if Thread2 sets terminateFlag (to stop Thread1), Thread2 gets blocked till poll() of Thread1 returns (which might be after several hours if there are no events on sockets). 由于这个超大的超时值,即使Thread2设置了terminateFlag(以停止Thread1),Thread2也会被阻塞,直到Thread1的poll()返回(如果套接字上没有事件,则可能在几个小时之后)。 So I'm sending signal to Thread1 using pthread_kill() to interrupt poll() system call (if it gets blocked). 所以我使用pthread_kill()向Thread1发送信号以中断poll()系统调用(如果它被阻止)。

static void signalHandler(int signum) {
    //Does nothing
}

// Thread 1 (Does I/O operations and handles scheduler events). 

void* Thread1(void* args) {
    terminateFlag = 0;
    while(!terminateFlag) {
        int millis = getTimeRemainedForNextEvent(); //calculate maximum number of milliseconds poll() can block.

        int ret = poll(fds,numOfFDs,millis);
        if(ret > 0) {
            //handle socket events.
        } else if (ret < 0) {
            if(errno == EINTR)
                perror("Poll Error");
            break;
        }

        handleEvent();  
    }
}

// Thread 2 (Terminates Thread 1 when Thread 1 needs to be terminated)

void* Thread2(void* args) {
    while(1) {

    /* Do other stuff */

    if(terminateThread1) {
            terminateFlag = 1;
            pthread_kill(ftid,SIGUSR2); //ftid is pthread_t variable of Thread1
            pthread_join( ftid, NULL );
        }
    }

    /* Do other stuff */
} 

Above code works fine if Thread2 sets terminateFlag and sends signal to Thread1 when it blocked in poll() system call. 如果Thread2设置terminateFlag并且在poll()系统调用中阻塞时向Thread1发送信号,则上面的代码可以正常工作。 But, If context switch happens after getTimeRemainedForNextEvent() function of Thread1 and Thread2 sets terminateFlag and sends signal, poll() of Thread1 gets blocked for several hours as it lost the signal that interrupts the system call. 但是,如果在getTimeRemainedForNextEvent()和Thread2的getTimeRemainedForNextEvent()函数之后发生上下文切换,则设置terminateFlag并发送信号,则Thread1的poll()会因为丢失中断系统调用的信号而被阻塞几个小时。

It seems I can not use mutex for synchronization as poll() will hold the lock till it gets unblocked. 似乎我不能使用互斥锁进行同步,因为poll()将保持锁定直到它被解除阻塞。 Is there any synchronization mechanism that I can apply to avoid the above mentioned issue ? 是否有任何同步机制可以应用以避免上述问题?

Consider having an additional file descriptor in the set of fds passed to poll whose sole job is to make poll return when you want to terminate the thread. 考虑在传递给poll的fds集合中有一个额外的文件描述符,其唯一的工作是在你想要终止线程时返回poll

Thus, in thread 2 we would have something like: 因此,在第2个主题中我们会有类似的东西:

if (terminateThread1) {
        terminateFlag = 1;
        send (terminate_fd, " ", 1, 0);
        pthread_join (ftid, NULL);
    }
}

And terminate_fd would be in the set of fds passed to poll by thread 1. terminate_fd将在由线程1传递给poll的fds集合中。

-- OR -- - 要么 -

If the overhead of having an extra fd per thread is too much (as discussed in the comments) then send something to one of the existing fds that thread 1 ignores. 如果每个线程有一个额外的fd的开销太大(如评论中所讨论的那样),那么就向线程1忽略的现有fds之一发送一些内容。 This will cause poll to return and then thread 1 will terminate. 这将导致poll返回,然后线程1将终止。 You can even have this 'special' value act as the terminate flag, which makes the logic a little tidier. 你甚至可以将这个'特殊'值作为终止标志,这使得逻辑更加整洁。

In the first place, access to shared variable terminateFlag by multiple threads must be protected by a mutex or similar synchronization mechanism, else your program does not conform and all bets are off. 首先,多个线程对共享变量terminateFlag的访问必须受到互斥锁或类似同步机制的保护,否则您的程序将不符合并且所有投注都将关闭。 That might, for instance, look like this: 例如,这可能是这样的:

void *Thread1(void *args) {
    pthread_mutex_lock(&a_mutex);
    terminateFlag = 0;
    while(!terminateFlag) {
        pthread_mutex_unlock(&a_mutex);

        // ...

        pthread_mutex_lock(&a_mutex);
    }
    pthread_mutex_unlock(&a_mutex);
}

void* Thread2(void* args) {
    // ...

    if (terminateThread1) {
        pthread_mutex_lock(&a_mutex);
        terminateFlag = 1;
        pthread_mutex_unlock(&a_mutex);
        pthread_kill(ftid,SIGUSR2); //ftid is pthread_t variable of Thread1
        pthread_join( ftid, NULL );
    }

    // ...
} 

But that does not solve the main problem , that a signal sent by thread 2 may be delivered to thread 1 after it tests terminateFlag but before it calls poll() , though it does narrow the window in which that could happen. 但这并没有解决主要问题 ,线程2发送的信号在测试terminateFlag但在调用poll()之前可能会被传递给线程1,尽管它确实缩小了可能发生这种情况的窗口。

The cleanest solution is that suggested already by @PaulSanders' answer: have thread 2 wake thread 1 via a file descriptor that thread 1 is polling (ie by means of a pipe). 最干净的解决方案是由@PaulSanders的回答建议的:让线程2通过文件描述符唤醒线程1线程1正在轮询(即通过管道)。 Inasmuch as you seem to have a plausible reason to seek an alternative approach, however, it should also be possible to make your signaling approach work by appropriate use of signal masking. 但是,由于您似乎有合理的理由寻求替代方法,因此也应该可以通过适当使用信号屏蔽来使您的信令方法起作用。 Expanding on @Shawn's comment, here's how it would work: 扩展@ Shawn的评论,以下是它的工作方式:

  1. The parent thread blocks SIGUSR2 before starting thread 1, so that the latter, which inherits its signal mask from its parent, starts with that signal blocked. 父线程在启动线程1之前阻塞SIGUSR2 ,以便后者从其父线程继承其信号掩码,从阻塞的信号开始。

  2. Thread 1 uses ppoll() instead of poll() , so as to be able to specify that SIGUSR2 will be unblocked for the duration of that call. 线程1使用ppoll()而不是poll() ,以便能够指定SIGUSR2在该调用期间将被解除阻塞。 ppoll() does signal mask handling atomically, so that there is no opportunity for a signal to be lost when it is blocked before the call and unblocked within. ppoll()确实以原子方式处理掩码处理,因此当在呼叫之前阻塞信号并且在其内部解锁时,没有机会丢失信号。

  3. Thread 2 uses pthread_kill() to send SIGUSR2 to thread 1 to make it stop. 线程2使用pthread_kill()SIGUSR2发送到线程1以使其停止。 Because that signal is only unblocked for that thread when it is performing a ppoll() call, it will not be lost (blocked signals remain pending until unblocked). 因为该信号在执行ppoll()调用时仅对该线程解除阻塞,所以它不会丢失(阻塞信号在未阻塞之前保持挂起状态)。 This is precisely the kind of usage scenario for which ppoll() is designed. 这正是设计ppoll()的那种使用场景。

  4. You should even be able to do away with the terminateThread variable and associated synchronization, because you should be able to rely upon the signal being delivered during a ppoll() call and therefore causing the EINTR code path to be exercised. 您甚至应该能够取消terminateThread变量和相关的同步,因为您应该能够依赖于在ppoll()调用期间传递的信号,从而导致执行EINTR代码路径。 That path does not rely on terminateThread to make the thread stop. 该路径不依赖terminateThread来使线程停止。

As you say yourself, you could use thread cancellation to solve this. 正如你自己所说,你可以使用线程取消来解决这个问题。 Outside of thread cancellation, I don't think there's a "right" way to solve this within POSIX (waking up the poll call with a write isn't exactly a generic method that would work for all situations in which a thread might get blocked), because POSIX's paradigm for making syscalls and handling signals simply doesn't allow you to close the gap between a flag check and a potentially long blocking call. 在线程取消之外,我认为在POSIX中有一种“正确”的方法来解决这个问题(使用write唤醒poll调用并不是一种通用方法,可以适用于线程可能被阻塞的所有情况),因为POSIX的系统调用和处理信号的范例不允许你缩小标志检查和可能长的阻塞调用之间的差距。

void handler() { dont_enter_a_long_blocking_call_flg=1; }
int main()
{  //...
    if(dont_enter_a_long_blocking_call_flg)
        //THE GAP; what if the signal arrives here ?
        potentially_long_blocking_call();
    //....
}

The musl libc library uses signals for thread cancellation (because signals can break long-blocking calls that are in kernel mode) and it uses them in conjunction with global assembly labels so that from the flag setting SIGCANCEL handler, it can do (conceptually, I'm not pasting their actual code): musl libc库使用信号进行线程取消(因为信号可以破坏处于内核模式的长阻塞调用)并且它将它们与全局程序集标签结合使用,以便从标志设置SIGCANCEL处理程序中,它可以做到(概念上,我我没有粘贴他们的实际代码:

void sigcancel_handler(int Sig, siginfo_t *Info, void *Uctx)
{
    thread_local_cancellation_flag=1;
    if_interrupted_the_gap_move_Program_Counter_to_start_cancellation(Uctx);
}

Now if you changed if_interrupted_the_gap_move_Program_Counter_to_start_cancellation(Uctx); 现在,如果你改变了if_interrupted_the_gap_move_Program_Counter_to_start_cancellation(Uctx); to if_interrupted_the_gap_move_Program_Counter_to_make_the_syscall_fail(Uctx); to if_interrupted_the_gap_move_Program_Counter_to_make_the_syscall_fail(Uctx); and exported the if_interrupted_the_gap_move_Program_Counter_to_make_the_syscall_fail function along with the thread_local_cancellation_flag . 并导出if_interrupted_the_gap_move_Program_Counter_to_make_the_syscall_fail函数以及thread_local_cancellation_flag

then you can use it to*: 然后你可以用它来*:

  • solve your problem robustly implement robust signal cancelation with any signal without having to put any of that pthread_cleanup_{push,pop} stuff into your already working thread-safe singel threaded code 解决您的问题,可以使用任何信号强大地实现强大的信号取消,而无需将任何pthread_cleanup_{push,pop}内容放入已经正常工作的线程安全的singel线程代码中
  • ensure assured normal-context reaction to a signal delivery in your target thread even if the signal is handled. 即使信号得到处理,也要确保对目标线程中的信号传递保持正常的上下文反应。

Basically without a libc extension like this, if you once kill()/pthread_kill() a process/thread with a signal it handles or if put a function on a signal-sending timer, you cannot be sure of an assured reaction to the signal delivery, as the target may well receive the signal in a gap like above and hang indefinitely instead of responding to it. 基本上没有像这样的libc扩展,如果你曾经使用它处理的信号kill()/pthread_kill()一个进程/线程,或者如果在信号发送计时器上放置一个函数,你就无法确定对信号的确定反应交付,因为目标可能会在上面的间隙中收到信号并无限期挂起而不是响应它。

I've implemented such a libc extension on top of musl libc and published it now https://github.com/pskocik/musl . 我已经在musl libc上实现了这样一个libc扩展,现在发布它https://github.com/pskocik/musl The SIGNAL_EXAMPLES directory also shows some kill() , pthread_kill , and setitimer() examples that under a demonstrated race condition hang with classical libcs but don't wit my extended musl. SIGNAL_EXAMPLES目录还显示了一些kill()pthread_killsetitimer()示例,这些示例在已证明的竞争条件下与经典libcs​​挂起但不具备扩展的musl。 You can use that extended musl to solve your problem cleanly and I also use it in my personal project to do robust thread cancellation without having to litter my code with pthread_cleanup_{push,pop} 您可以使用扩展的musl来干净地解决您的问题,我也可以在我的个人项目中使用它来执行强大的线程取消,而不必使用pthread_cleanup_{push,pop}来丢弃我的代码

The obvious downside of this approach is that it's unportable and I only have it implemented for x86_64 musl. 这种方法的明显缺点是它不可移植,我只为x86_64 musl实现了它。 I've published it today in the hope that somebody (Cygwin, MacOSX?) copies it, because I think it's the right way to do cancellation in C. 我今天发布了它,希望有人(Cygwin,MacOSX?)复制它,因为我认为这是在C中取消的正确方法。

In C++ and with glibc, you could utilize the fact that glibc uses exceptions to implement thread cancellation and simply use pthread_cancel (which uses a signal (SIGCANCEL) underneath) but catch it instead of letting it kill the thread. 在C ++和glibc中,你可以利用glibc使用异常来实现线程取消的事实,并简单地使用pthread_cancel (它使用下面的信号(SIGCANCEL))但是捕获它而不是让它杀死线程。


Note: 注意:

I'm really using two thread-local flags -- a breaker flag that breaks the next syscall with ECANCELED if set before the syscall is entered (an EINTR returned from a potentially long-blocking syscall gets turned into ECANCELED in the modified libc-provided syscall wrapper iff the breaking flag is set) and a saved breaking flag -- the moment a breaking flag has been used it's saved in the saved breaking flag and zeroed so that the breaking flag doesn't break futher potentially long blocking syscalls. 我真的使用两个线程本地标志 - 一个断路器标志,如果在输入系统调用之前设置,则使用ECANCELED中断下一个系统调用(从一个可能长时间阻塞的系统调用返回的EINTR在修改后的libc中被转换为ECANCELED syscall包装器iff设置了中断标志)和一个保存的中断标志 - 在使用中断标志的那一刻它被保存在保存的中断标志中并归零,这样中断标志不会打破进一步可能长时间阻塞的系统调用。

The idea is that cancelling signals are handled one at a time (the signal handler can be left with all/most signals blocked; the handler code (if any) can then unblock them) and that correctly checking code starts unwinding, ie, cleaning up while returning errors, the moment it sees an ECANCELED. 这个想法是一次一个地处理取消信号(信号处理程序可以保留所有/大多数信号被阻止;处理程序代码(如果有的话)然后可以解锁它们)并且正确检查代码开始展开,即清理在返回错误的同时,它看到了ECANCELED。 Then, the next potentially long blocking syscall could be in the cleanup code (eg, code that writes </html> to a socket) and that syscall must be enterrable (if the breaking flag stayed on, it wouldn't be). 然后,下一个可能很长的阻塞系统调用可以在清理代码中(例如,将</html>写入套接字的代码),并且系统调用必须是可输入的(如果中断标志保持打开,则不会)。 Of course with cleanup code having eg, write(1,"</html>",...) in it, it could block indefinitely too, but you could write the cleanup code so that the potentially long-blocking syscall there runs under a timer when the cleanup is due to an error (ECANCELED is an error). 当然,如果清理代码中包含例如write(1,"</html>",...) ,它也可以无限期地阻塞,但是你可以编写清理代码,以便在那里运行可能长时间阻塞的系统调用清理是由于错误导致的计时器(ECANCELED是错误)。 As I've already mentioned, robust, race-condition free, signal driven timers is one of the things this extension allows. 正如我已经提到的,强大的,无竞争条件的信号驱动定时器是此扩展允许的事情之一。

The EINTR => ECANCELED translation happens so that code looping on EINTR knows when to stop looping (many EINTR (=signal interrupted a syscall) cannot be prevented and the code should simply handle them by retrying the syscall. I'm using ECANCELED as an "EINTR after which you shouldn't retry." 发生EINTR => ECANCELED转换,以便EINTR上的代码循环知道何时停止循环(许多EINTR(=信号中断系统调用)无法阻止,代码应该通过重试系统调用来处理它们。我使用ECANCELED作为“EINTR之后你不应该重试。”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM