pthread_join（）中的段错误

Question

The below program processes nbFiles files using 1 worker thread per GROUPSIZE files. 以下程序对nbFiles文件使用1个工作线程处理GROUPSIZE文件。 No more than MAXNBRTHREADS worker threads are run in parallel. 并行运行最多MAXNBRTHREADS工作线程。 A watchDog() thread (thread 0) is used to shepherd the PTHREAD_CANCEL_DEFERRED identical workers. watchDog()线程（线程0）用于牧羊PTHREAD_CANCEL_DEFERRED相同的工作线程。 If anyone of the workers fails, it pthread_cond_signal(&errCv) the watchDog under the protection of global mutex mtx , passing its thread ID via the errIndc predicate. 如果任何一个工作程序失败，它将在全局互斥锁mtx的保护下pthread_cond_signal(&errCv) watchDog ，并通过errIndc谓词传递其线程ID。 watchDog then cancels all running threads (global oldest maintains the ID of the oldest thread still alive to help it do this), and exits the program. 然后， watchDog取消所有正在运行的线程（全局oldest旧的线程将保持最旧线程的ID仍在运行以帮助其执行此操作），并退出程序。

// compile with: gcc -Wall -Wextra -Wconversion -pedantic -std=c99 -g -D_BSD_SOURCE -pthread -o pFiles pFiles.c
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <stdint.h>
#include "pthread.h"

#define INDIC_ALL_DONE_OK  -1

typedef  int_fast32_t  int32;
typedef uint_fast32_t uint32;

uint32 MAXNBRTHREADS = 10; // no more than this amount of threads running in parallel
uint32 GROUPSIZE = 10;   // how many files per thread

uint32 nbFiles, gThID;  // total #files, group ID for a starting thread

int32 errIndc = 0;  // global thread error indicator

pthread_t *thT;     // pthread table
void **retVals;     // thread ret. val. table, needed in stop_watchDog()
uint32 gThCnt;   // calculated size of thT[]
uint32 thCnt, oldest;  // running threads count (as they are created), oldest thread *alive*

pthread_cond_t  errCv = PTHREAD_COND_INITIALIZER;  // thread-originated error signal
pthread_mutex_t mtx   = PTHREAD_MUTEX_INITIALIZER; // mutex to protect errIndc

// Worker thread
void *processFileGroup(void *arg) {
  int32 err;
  int last_state, last_type;
  uint32 i, thId = (uint32)(intptr_t) arg;

  fprintf(stderr, "th %ld started\n", thId);

  pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &last_state);
  pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, &last_type);

  // Artificial error in thread 17
  if(thId==17) {
    pthread_mutex_lock(&mtx);
      errIndc = (int32) thId;
      pthread_cond_signal(&errCv);
    pthread_mutex_unlock(&mtx);
    pthread_exit((void *)(intptr_t)err); }

  for(i = 0; i < GROUPSIZE ; i++) {  // simulate processing GROUPSIZE files
    pthread_testcancel();
    err = 0;
    if(usleep(10000)) { err = 1; break; }
  }

  //fprintf(stderr, "  -- th %ld done with err = %ld\n", thId, err);
  if(err!=0) { // Signal watch dog
    pthread_mutex_lock(&mtx);
      errIndc = (int32) thId;
      pthread_cond_signal(&errCv);
    pthread_mutex_unlock(&mtx);
    pthread_exit((void *)(intptr_t) err);
  }

  pthread_exit((void *)(intptr_t) err);
}

// Mishap : cancel existing threads, exit program
int32 cancel_exit(int32 rc, int32 faultyThId, char *msg) {
  uint32 j; int32 rval;
  void *retVal;
  if(rc==0) return 0;
  if(msg!=NULL && msg[0]=='\0') fprintf(stderr, "\nError in thread %ld. Stoping..\n", faultyThId);
  else                          fprintf(stderr, "\n%s %ld. Stop.\n\n", msg, faultyThId);
  for(j = oldest; j < thCnt ; j++) pthread_cancel(thT[j]);
  for(j = oldest; j < thCnt ; j++){
    pthread_join(thT[j], &retVal); rval = (int)(intptr_t) retVal;
    //if(retVal == PTHREAD_CANCELED || rval==115390242)
    if(retVal == PTHREAD_CANCELED)
         fprintf(stderr, "  cexit: thread %ld canceled\n", j);
    else fprintf(stderr, "  cexit: thread %ld finished, rc = %ld\n", j, rval);
  }
  pthread_join(thT[4], &retVal); rval = (int)(intptr_t) retVal; fprintf(stderr, "  cexit1: thread 4 finished, rc = %ld\n", rval);
  fprintf(stderr, "Processing stopped\n\n");
  exit(EXIT_FAILURE); return rc;
}

// Watch dog thread
// it fires on signal from one of the running threads about a mishap
void *watchDog(void *arg) {
  int32 err;
  pthread_mutex_lock(&mtx);
    while (errIndc == 0) {
      pthread_cond_wait(&errCv,&mtx);
      if(errIndc == INDIC_ALL_DONE_OK){   // main() says we're done with no issues
        pthread_mutex_unlock(&mtx);
        err = 0; pthread_exit((void *)(intptr_t) err);
      }
    }
  pthread_mutex_unlock(&mtx);
  fprintf(stderr, "watch dog: stopping on error indication %ld\n", errIndc);
  cancel_exit(1, errIndc, "");
  exit(EXIT_FAILURE); return arg;// not reached
}

void stop_watchDog() {
  pthread_mutex_lock(&mtx);
    errIndc = INDIC_ALL_DONE_OK;
    pthread_cond_signal(&errCv);
  pthread_mutex_unlock(&mtx);
  pthread_join(thT[0], &retVals[0]);
}

int main() {

  uint32 i, k;
  int32 rc;

  nbFiles = 950;
  gThCnt = 1+nbFiles/GROUPSIZE;

  if(gThCnt > MAXNBRTHREADS)
    fprintf(stderr, "running max %ld threads in parallel\n", MAXNBRTHREADS);
  else fprintf(stderr, "using %ld worker thread(s)\n", gThCnt);

  gThCnt++; // account for watchDog (thread 0)

  thT = (pthread_t *) calloc(gThCnt, sizeof(pthread_t));  if(thT==NULL) { perror("calloc"); exit(EXIT_FAILURE); }
  retVals = (void **) calloc( (nbFiles/GROUPSIZE), sizeof(void *));  if(retVals==NULL) { perror("calloc"); exit(EXIT_FAILURE); }

  // Start watch dog
  rc = pthread_create(&thT[0], NULL, watchDog, NULL);
  if(rc != 0) { fprintf(stderr,"pthread_create() failed for thread 0\n"); exit(EXIT_FAILURE); }
  thCnt = 1;

  i = 0; oldest = 1;
  while(thCnt<gThCnt) {
    pthread_mutex_lock(&mtx);
      if(errIndc != 0){   // watchDog is already tearing down the whole system, no point in creating more threads
        pthread_join(thT[0], &retVals[0]); // wait on WatchDog thread, which never returns (it cancel_exists).
        exit(EXIT_FAILURE);  // not reached
      }
    pthread_mutex_unlock(&mtx);

    gThID = thCnt;
    rc = pthread_create(&thT[thCnt], NULL, processFileGroup, (void *)(intptr_t) gThID);
    if(rc != 0) {
      fprintf(stderr,"pthread_create() failed for thread %ld\n", thCnt);
      stop_watchDog();
      cancel_exit(1, (int32)thCnt, "Could not create thread");
    }
    thCnt++;
    if(thCnt>MAXNBRTHREADS) {  // wait for the oldest thread to finish
      pthread_mutex_lock(&mtx);
        if(errIndc != 0) {   // watchDog is already tearing down the whole system, he'll report the rc of thread "oldest"
          printf("[MAXNBRTHREADS] errIndc=%ld, joining watchDog\n", errIndc);
          pthread_join(thT[0], &retVals[0]); // wait on WatchDog thread, which never returns (it cancel_exists).
          exit(EXIT_FAILURE);  // not reached
        }
      pthread_mutex_unlock(&mtx);
      pthread_join(thT[oldest], &retVals[oldest]); rc = (int)(intptr_t) retVals[oldest];
      fprintf(stderr, "[MAXNBRTHREADS] Thread %ld done with rc = %ld\n", oldest, rc);
      oldest++;
    }
  }
  k = oldest;
  while(k<thCnt) {
    pthread_mutex_lock(&mtx);
      if(errIndc != 0){   // watchDog is already tearing down the whole system, he'll report the rc of thread k
        pthread_join(thT[0], &retVals[0]); // wait on WatchDog thread, which never returns (it cancel_exists).
        exit(EXIT_FAILURE);  // not reached
      }
    pthread_mutex_unlock(&mtx);
    pthread_join(thT[k], &retVals[k]); rc = (int)(intptr_t) retVals[k];
    fprintf(stderr, "Thread %ld done with rc = %ld\n", k, rc);
    oldest = ++k;
  }

  // Signal watch dog to quit
  stop_watchDog();

  exit(EXIT_SUCCESS);

}

Line 82 causes this program to segfault. 第82行导致该程序出现段错误。 Why ? 为什么呢 Is it illegal to join a canceled thread ? 加入被取消的线程是否非法？

If you comment line 82, other issues show up. 如果您在第82行注释，则会出现其他问题。 If you run the program 3 of 4 times you witness one of theses pathological outcomes : 如果您运行该程序4次中的3次，您会看到其中一种病理结果：

How can thread 11 have two different exit codes ? 线程11如何具有两个不同的退出代码？

.. 
watch dog: stopping on error indication 17

Error in thread 17. Stoping..
th 19 started
  cexit: thread 11 finished, rc = 115390242
[MAXNBRTHREADS] Thread 11 done with rc = -1

Sometimes the program will hang in MAXNBRTHREADS section : 有时程序会挂在MAXNBRTHREADS部分：

...
[MAXNBRTHREADS] errIndc=17, joining watchDog

There is apparently a race condition in this section; 本节中显然存在比赛条件； but I couldn't figure it out. 但我不知道。

Any help appreciated. 任何帮助表示赞赏。

Answer 1

You ask: 你问：

Line 82 causes this program to segfault. 第82行导致该程序出现段错误。 Why ? 为什么呢 Is it illegal to join a canceled thread ? 加入被取消的线程是否非法？

POSIX does not say that in so many words, but it certainly seems to imply so. POSIX并没有这么多地这么说，但是似乎确实暗示了这一点。 The specifications for pthread_join() say: pthread_join()的规范说：

The behavior is undefined if the value specified by the thread argument to pthread_join() does not refer to a joinable thread. 如果pthread_join（）的thread参数指定的值未引用可连接线程，则该行为未定义。

and later, in the RATIONALE, 然后在RATIONALE中

If an implementation detects use of a thread ID after the end of its lifetime, it is recommended that the function should fail and report an [ESRCH] error. 如果实现在其生命周期结束后检测到线程ID的使用，则建议该函数失败并报告[ESRCH]错误。

The segfault you observe is not consistent with the (non-normative) recommendation in the rationale, but the rationale does support the proposition that a thread is no longer a "joinable thread" after its lifetime has ended (eg because it has been canceled), for otherwise the recommendation would be inconsistent with the function's specified behavior. 您观察到的段错误与基本原理中的（非规范）建议不一致，但是基本原理确实支持这样的主张，即线程在其生命周期结束后不再是“可连接线程”（例如，因为已被取消），否则建议将与功能的指定行为不一致。 Certainly threads that have already been joined are no longer joinable, though the reason to use "joinable" instead of "live" or similar is probably more the provisions for detaching threads. 当然，已经连接的线程不再可连接，尽管使用“ joinable”而不是“ live”或类似内容的原因可能更多是用于分离线程。

How can thread 11 have two different exit codes ? 线程11如何具有两个不同的退出代码？

It can't, and your output does not demonstrate otherwise. 它不能，并且您的输出不会显示其他内容。 You are joining thread 11 twice, so at least one of those pthread_join() calls must fail. 您将两次加入线程11，因此这些pthread_join()调用中的至少一个必须失败。 In the event that it does, you cannot rely on any result value that it might have stored (not based on POSIX, anyway). 如果发生这种情况，则您将不能依赖它可能存储的任何结果值（无论如何都不基于POSIX）。 You ought to check the return values of your function calls for error flags. 您应该检查函数调用的返回值以获取错误标志。

Sometimes the program will hang in MAXNBRTHREADS section 有时程序会挂在MAXNBRTHREADS部分

Yes, it appears that it could do. 是的，看来可以做到。

The idea here appears to be that in the failure case, the main thread will call stop_watchDog() , which will set a flag to advise the watchdog thread that it should stop, and then signal the condition variable to make the watchdog wake up and notice it. 这里的想法似乎是在失败的情况下，主线程将调用stop_watchDog() ，该线程将设置一个标志以通知看门狗线程它应该停止，然后发出条件变量以使看门狗醒来并注意它。 When it does wake up, the watchdog thread must re-acquire mutex mtx before it can return from pthread_cond_wait() . 唤醒后，看门狗线程必须重新获取互斥锁mtx然后才能从pthread_cond_wait()返回。

After returning from stop_watchDog() , the main thread locks mutex mtx and attempts to join the watchdog thread. 从stop_watchDog()返回后，主线程将锁定互斥锁mtx并尝试加入看门狗线程。 But signaling a CV is not synchronous. 但是，发送CV信号不是同步的。 It is therefore possible that the main thread locks the mutex before the watchdog thread reacquires it, in which case you will deadlock: the watchdog cannot return from pthread_cond_wait() and proceed to terminate until it acquires the mutex, but the main thread will not unlock the mutex until the watchdog terminates. 因此，主线程可能在看门狗线程重新获取互斥之前将其锁定，在这种情况下，您将死锁：看门狗无法从pthread_cond_wait()返回并继续终止直到获取互斥锁，但主线程不会解锁互斥锁，直到看门狗终止。

I haven't analyzed the program enough to be sure exactly what state the main thread needs to protect there, though it appears to include at least the errIndc variable. 尽管似乎至少包含errIndc变量，但我还没有对程序进行足够的分析来确定主线程需要在其中保护什么状态。 Any way around, however, it does not appear to need to hold the mutex locked while trying to join the watchdog thread. 但是，无论如何，在尝试加入看门狗线程时，似乎都不需要将互斥锁保持锁定状态。

pthread_join（）中的段错误

问题描述

1 个解决方案

解决方案1
1 2017-10-19 15:33:58

pthread_join（）中的段错误

问题描述

1 个解决方案

解决方案1 1 2017-10-19 15:33:58

解决方案1
1 2017-10-19 15:33:58