克隆的奇怪行为

Question

This is fairly simple application which creates a lightweight process (thread) with clone() call. 这是一个相当简单的应用程序，它使用clone()调用创建一个轻量级进程（线程）。

#define _GNU_SOURCE

#include <sched.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <time.h>

#define STACK_SIZE 1024*1024

int func(void* param) {
    printf("I am func, pid %d\n", getpid());    
    return 0;
}

int main(int argc, char const *argv[]) {
    printf("I am main, pid %d\n", getpid());
    void* ptr = malloc(STACK_SIZE);

    printf("I am calling clone\n");             
    int res = clone(func, ptr + STACK_SIZE, CLONE_VM, NULL);
    // works fine with sleep() call
    // sleep(1);

    if (res == -1) {
        printf("clone error: %d", errno);       
    } else {
        printf("I created child with pid: %d\n", res);      
    }

    printf("Main done, pid %d\n", getpid());        
    return 0;
}

Here are results: 结果如下：

Run 1: 运行1：

➜  LFD401 ./clone
I am main, pid 10974
I am calling clone
I created child with pid: 10975
Main done, pid 10974
I am func, pid 10975

Run 2: 运行2：

➜  LFD401 ./clone
I am main, pid 10995
I am calling clone
I created child with pid: 10996
I created child with pid: 10996
I am func, pid 10996
Main done, pid 10995

Run 3: 运行3：

➜  LFD401 ./clone
I am main, pid 11037
I am calling clone
I created child with pid: 11038
I created child with pid: 11038
I am func, pid 11038
I created child with pid: 11038
I am func, pid 11038
Main done, pid 11037

Run 4: 运行4：

➜  LFD401 ./clone
I am main, pid 11062
I am calling clone
I created child with pid: 11063
Main done, pid 11062
Main done, pid 11062
I am func, pid 11063

What is going on here? 这里发生了什么？ Why "I created child" message is sometimes printed several times? 为什么“我创造孩子”的信息有时会被打印几次？

Also I noticed that adding a delay after clone call "fixes" the problem. 此外，我注意到clone调用后添加延迟“修复”了问题。

Answer 1

You have a race condition (ie) you don't have the implied thread safety of stdio. 你有一个竞争条件（即）你没有stdio隐含的线程安全性。

The problem is even more severe. 问题更严重。 You can get duplicate "func" messages. 您可以获得重复的“func”消息。

The problem is that using clone does not have the same guarantees as pthread_create . 问题是使用clone与pthread_create没有相同的保证。 (ie) You do not get the thread safe variants of printf . （即）你没有获得printf的线程安全变体。

I don't know for sure, but, IMO the verbiage about stdio streams and thread safety, in practice, only applies when using pthreads . 我不确定，但是，IMO关于stdio流和线程安全的措辞在实践中仅适用于使用pthreads 。

So, you'll have to handle your own interthread locking. 所以，你必须处理你自己的线程锁定。

Here is a version of your program recoded to use pthread_create . 以下是重新编码为使用pthread_create程序版本。 It seems to work without incident: 它似乎没有发生任何事故：

#define _GNU_SOURCE

#include <sched.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>

#define STACK_SIZE 1024*1024

void *func(void* param) {
    printf("I am func, pid %d\n", getpid());
    return (void *) 0;
}

int main(int argc, char const *argv[]) {
    printf("I am main, pid %d\n", getpid());
    void* ptr = malloc(STACK_SIZE);

    printf("I am calling clone\n");

    pthread_t tid;
    pthread_create(&tid,NULL,func,NULL);
    //int res = clone(func, ptr + STACK_SIZE, CLONE_VM, NULL);
    int res = 0;

    // works fine with sleep() call
    // sleep(1);

    if (res == -1) {
        printf("clone error: %d", errno);
    } else {
        printf("I created child with pid: %d\n", res);
    }

    pthread_join(tid,NULL);
    printf("Main done, pid %d\n", getpid());
    return 0;
}

Here is a test script I've been using to check for errors [it's a little rough, but should be okay]. 这是我用来检查错误的测试脚本[这有点粗糙，但应该没问题]。 Run against your version and it will abort quickly. 针对您的版本运行，它将很快中止。 The pthread_create version seems to pass just fine pthread_create版似乎传递得很好

#!/usr/bin/perl
# clonetest -- clone test
#
# arguments:
#   "-p0" -- suppress check for duplicate parent messages
#   "-c0" -- suppress check for duplicate child messages
#   1 -- base name for program to test (e.g. for xyz.c, use xyz)
#   2 -- [optional] number of test iterations (DEFAULT: 100000)

master(@ARGV);
exit(0);

# master -- master control
sub master
{
    my(@argv) = @_;
    my($arg,$sym);

    while (1) {
        $arg = $argv[0];
        last unless (defined($arg));

        last unless ($arg =~ s/^-(.)//);
        $sym = $1;

        shift(@argv);

        $arg = 1
            if ($arg eq "");

        $arg += 0;
        ${"opt_$sym"} = $arg;
    }

    $opt_p //= 1;
    $opt_c //= 1;
    printf("clonetest: p=%d c=%d\n",$opt_p,$opt_c);

    $xfile = shift(@argv);
    $xfile //= "clone1";
    printf("clonetest: xfile='%s'\n",$xfile);

    $itermax = shift(@argv);
    $itermax //= 100000;
    $itermax += 0;
    printf("clonetest: itermax=%d\n",$itermax);

    system("cc -o $xfile -O2 $xfile.c -lpthread");
    $code = $? >> 8;
    die("master: compile error\n")
        if ($code);

    $logf = "/tmp/log";

    for ($iter = 1;  $iter <= $itermax;  ++$iter) {
        printf("iter: %d\n",$iter)
            if ($opt_v);
        dotest($iter);
    }
}

# dotest -- perform single test
sub dotest
{
    my($iter) = @_;
    my($parcnt,$cldcnt);
    my($xfsrc,$bf);

    system("./$xfile > $logf");

    open($xfsrc,"<$logf") or
        die("dotest: unable to open '$logf' -- $!\n");

    while ($bf = <$xfsrc>) {
        chomp($bf);

        if ($opt_p) {
            while ($bf =~ /created/g) {
                ++$parcnt;
            }
        }

        if ($opt_c) {
            while ($bf =~ /func/g) {
                ++$cldcnt;
            }
        }
    }

    close($xfsrc);

    if (($parcnt > 1) or ($cldcnt > 1)) {
        printf("dotest: fail on %d -- parcnt=%d cldcnt=%d\n",
            $iter,$parcnt,$cldcnt);
        system("cat $logf");
        exit(1);
    }
}

UPDATE: 更新：

Were you able to recreate OPs problem with clone? 您是否能够使用克隆重新创建OP问题？

Absolutely. 绝对。 Before I created the pthreads version, in addition to testing OP's original version, I also created versions that: 在我创建pthreads版本之前，除了测试OP的原始版本之外，我还创建了以下版本：

(1) added setlinebuf to the start of main （1）将setlinebuf添加到main的开头

(2) added fflush just before the clone and __fpurge as the first statement of func （2）在clone和__fpurge之前添加fflush作为func的第一个语句

(3) added an fflush in func before the return 0 （3）在return 0之前在func添加了fflush

Version (2) eliminated the duplicate parent messages, but the duplicate child messages remained 版本（2）消除了重复的父消息，但重复的子消息仍然存在

If you'd like to see this for yourself, download OP's version from the question, my version, and the test script. 如果您想亲眼看到这个，请从问题，我的版本和测试脚本中下载OP的版本。 Then, run the test script on OP's version. 然后，在OP的版本上运行测试脚本。

I posted enough information and files so that anyone can recreate the problem. 我发布了足够的信息和文件，以便任何人都可以重新创建问题。

Note that due to differences between my system and OP's, I couldn't at first reproduce the problem on just 3-4 tries. 请注意，由于我的系统和OP之间的差异，我不能在3-4次尝试时重现问题。 So, that's why I created the script. 所以，这就是我创建脚本的原因。

The script does 100,000 test runs and usually the problem will manifest itself within 5000-15000. 该脚本执行100,000次测试运行，通常问题将在5000-15000内表现出来。

Answer 2

Your processes both use the same stdout (that is, the C standard library FILE struct), which includes an accidentally shared buffer. 您的进程都使用相同的stdout （即C标准库FILE结构），其中包含一个意外共享的缓冲区。 That's undoubtedly causing problems. 这无疑会造成问题。

Answer 3

I can't recreate OP's issue, but I don't think the printf's are actually a problem. 我不能重新创建OP的问题，但我不认为printf实际上是一个问题。

glibc docs : glibc文档：

The POSIX standard requires that by default the stream operations are atomic. POSIX标准要求默认情况下流操作是原子操作。 Ie, issuing two stream operations for the same stream in two threads at the same time will cause the operations to be executed as if they were issued sequentially. 即，同时在两个线程中对同一流发出两个流操作将导致操作被执行，就像它们是按顺序发出一样。 The buffer operations performed while reading or writing are protected from other uses of the same stream. 在读取或写入时执行的缓冲操作受到保护，不受同一流的其他使用的影响。 To do this each stream has an internal lock object which has to be (implicitly) acquired before any work can be done. 为此，每个流都有一个内部锁定对象，必须（隐式）获取才能完成任何工作。

Edit: 编辑：

Even though the above is true for threads, as rici points out, there is a comment on sourceware : 尽管上述情况适用于线程，但正如rici指出的那样，对源软件有一个评论：

Basically, there's nothing you can safely do with CLONE_VM unless the child restricts itself to pure computation and direct syscalls (via sys/syscall.h). 基本上，除非孩子将自己限制为纯计算和直接系统调用（通过sys / syscall.h），否则你无法安全地使用CLONE_VM。 If you use any of the standard library, you risk the parent and child clobbering each other's internal states. 如果您使用任何标准库，您可能会冒着父母和孩子相互破坏彼此内部状态的风险。 You also have issues like the fact that glibc caches the pid/tid in userspace, and the fact that glibc expects to always have a valid thread pointer which your call to clone is unable to initialize correctly because it does not know (and should not know) the internal implementation of threads. 你也遇到了glibc在用户空间中缓存pid / tid的事实，以及glibc期望总是有一个有效的线程指针这一事实，你对clone的调用无法正确初始化，因为它不知道（并且不应该知道））线程的内部实现。

Apparently, glibc isn't designed to work with clone if CLONE_VM is set but CLONE_THREAD|CLONE_SIGHAND are not. 显然，如果设置了CLONE_VM但是没有CLONE_THREAD | CLONE_SIGHAND，那么glibc不能用于克隆。

Answer 4

Ass everyone suggests: it really seems to be a problem with, how shall I put it in case of clone() , process-safety? 每个人都暗示：这似乎是一个问题，如何clone() ，进程安全？ With a rough sketch of a locking version of printf (using write(2) ) the output is as expected. 使用printf的锁定版本的粗略草图（使用write(2) ），输出是预期的。

#define _GNU_SOURCE

#include <sched.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#include <errno.h>
#include <stdlib.h>
#include <time.h>

#define STACK_SIZE 1024*1024

// VERY rough attempt at a thread-safe printf
#include <stdarg.h>
#define SYNC_REALLOC_GROW 64
int sync_printf(const char *format, ...)
{
  int n, all = 0;
  int size = 256;
  char *p, *np;
  va_list args;

  if ((p = malloc(size)) == NULL)
    return -1;

  for (;;) {
    va_start(args, format);
    n = vsnprintf(p, size, format, args);
    va_end(args);
    if (n < 0)
      return -1;
    all += n;
    if (n < size)
      break;
    size = n + SYNC_REALLOC_GROW;
    if ((np = realloc(p, size)) == NULL) {
      free(p);
      return -1;
    } else {
      p = np;
    }
  }
  // write(2) shoudl be threadsafe, so just in case
  flockfile(stdout);
  n = (int) write(fileno(stdout), p, all);
  fflush(stdout);
  funlockfile(stdout);
  va_end(args);
  free(p);
  return n;
}


int func(void *param)
{
  sync_printf("I am func, pid %d\n", getpid());
  return 0;
}

int main()
{

  sync_printf("I am main, pid %d\n", getpid());
  void *ptr = malloc(STACK_SIZE);

  sync_printf("I am calling clone\n");
  int res = clone(func, ptr + STACK_SIZE, CLONE_VM, NULL);
  // works fine with sleep() call
  // sleep(1);

  if (res == -1) {
    sync_printf("clone error: %d", errno);
  } else {
    sync_printf("I created child with pid: %d\n", res);
  }
  sync_printf("Main done, pid %d\n\n", getpid());
  return 0;
}

For the third time: it's only a sketch, no time for a robust version, but that shouldn't hinder you to write one. 第三次：它只是一个草图，没有时间用于强大的版本，但这不应该妨碍你写一个。

Answer 5

As evaitl points out printf is documented to be thread-safe by glibc's documentation. 正如evaitl指出的那样， glfc的文档记录了printf是线程安全的。 BUT , this typically assumes that you are using the designated glibc function to create threads (that is, pthread_create() ). 但是，这通常假设您使用指定的glibc函数来创建线程（即pthread_create() ）。 If you do not, then you are on your own. 如果你不这样做，那么你就是靠自己。

The lock taken by printf() is recursive (see flockfile ). printf()采用的锁是递归的（参见flockfile ）。 This means that if the lock is already taken, the implementation checks the owner of the lock against the locker. 这意味着如果已经采取锁定，则实现将检查锁定的所有者对锁定器。 If the locker is the same as the owner, the locking attempt succeeds. 如果锁定器与所有者相同，则锁定尝试成功。

To distinguish between different threads, you need to setup properly TLS , which you do not do, but pthread_create() does. 要区分不同的线程，您需要正确设置TLS ，而不是pthread_create() 。 What I'm guessing happens is that in your case the TLS variable that identifies the thread is the same for both threads, so you end up taking the lock. 我猜测的是，在你的情况下，标识线程的TLS变量对于两个线程都是相同的，所以你最终获得了锁。

TL;DR: please use pthread_create() TL; DR：请使用pthread_create()

克隆的奇怪行为

问题描述

5 个解决方案

解决方案1
5 已采纳 2016-07-20 23:01:44

解决方案2
3 2016-07-20 21:29:40

解决方案3
3 2016-07-20 22:16:29

解决方案4
2 2016-07-20 22:20:37

解决方案5
2 2016-07-20 23:07:29

克隆的奇怪行为

问题描述

5 个解决方案

解决方案1 5 已采纳 2016-07-20 23:01:44

解决方案2 3 2016-07-20 21:29:40

解决方案3 3 2016-07-20 22:16:29

解决方案4 2 2016-07-20 22:20:37

解决方案5 2 2016-07-20 23:07:29

解决方案1
5 已采纳 2016-07-20 23:01:44

解决方案2
3 2016-07-20 21:29:40

解决方案3
3 2016-07-20 22:16:29

解决方案4
2 2016-07-20 22:20:37

解决方案5
2 2016-07-20 23:07:29