在多核计算机（Linux OS）中，当进程调度程序将一个进程迁移到另一个CPU时

Question

In my program, whose rss is 65G, when call fork , sys_clone->dup_mm->copy_page_range will consume more than 2 seconds. 在我的rss为65G的程序中，调用fork ， sys_clone->dup_mm->copy_page_range将消耗2秒钟以上的时间。 In this case, one cpu will 100% sys when execute fork, at the same time, one thread cannot get cpu time until fork finish. 在这种情况下，执行fork时，一个cpu将100％sys，同时，直到fork完成，一个线程才能获得cpu时间。 The machine has 16 CPUs, the other CPUs is idle. 机器有16个CPU，其他CPU处于空闲状态。

So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu? 所以我的问题是一个cpu忙于派生，为什么调度程序不将等待此cpu的进程迁移到其他空闲cpu？ In general, when and how the scheduler migrate process between cpus? 通常，调度程序何时以及如何在cpus之间迁移进程？

I search this site, and the existing threads cannot answer my question. 我搜索此站点，并且现有线程无法回答我的问题。

Answer 1

rss is 65G, when call fork, sys_clone->dup_mm->copy_page_range will consume more than 2 seconds rss为65G，当调用fork时，sys_clone-> dup_mm-> copy_page_range将消耗2秒钟以上

While doing fork (or clone ) the vmas of existing process should be copied into vmas of new process. 在进行fork （或clone ）操作时，应将现有进程的vmas复制到新进程的vmas中。 dup_mm function (kernel/fork.c) creates new mm and do actual copy. dup_mm函数（kernel / fork.c）创建新的mm并进行实际复制。 There are no direct calls to copy_page_range , but I think, static function dup_mmap may be inlined into dup_mm and it has calls to copy_page_range . 没有直接调用copy_page_range ，但是我认为，静态函数dup_mmap可以内联到dup_mm并且它具有对copy_page_range调用。

In the dup_mmap there are several locks locked, both in new mm and old oldmm : 在dup_mmap中，新的mm和旧的oldmm都锁定了几个锁：

356         down_write(&oldmm->mmap_sem);

After taking the mmap_sem reader/writer semaphore, there is a loop over all mmaps to copy their metainformation: 取得mmap_sem读取器/写入器信号量后，所有mmap上都有一个循环来复制其元信息：

381         for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next)

Only after the loop (it is long in your case), mmap_sem is unlocked: 仅在循环（您的情况很长）之后， mmap_sem解锁：

465 out:
468         up_write(&oldmm->mmap_sem);

While the rwlock mmap_sep is down by writer, no any other reader or writer can do anything with mmaps in oldmm . 虽然rwlock mmap_sep被writer降低了，但是其他任何阅读器或作家都无法对oldmm中的oldmm做任何事情。

one thread cannot get cpu time until fork finish So my question is one cpu was busy on fork, why the scheduler don't migrate the process waiting on this cpu to other idle cpu? 一个线程在派生完成之前无法获得cpu时间所以我的问题是一个cpu忙于派生，为什么调度程序不将等待此cpu的进程迁移到其他空闲cpu？

Are you sure, that other thread is ready to run and not wanting to do anything with mmaps, like: 您确定其他线程已准备好运行并且不想对mmap做任何事情，例如：

mmaping something new or unmapping something not needed, 映射新内容或取消不需要的内容，
growing or shrinking its heap ( brk ), 增加或缩小其堆（ brk ），
growing its stack, 增加堆栈，
pagefaulting 页面错误
or many other activities...? 或许多其他活动...？

Actually, the wait-cpu thread is my IO thread, which send/receive package from client, in my observation, the package always exist, but the IO thread cannot receive it. 实际上，wait-cpu线程是我的IO线程，它从客户端发送/接收程序包，据我观察，该程序包始终存在，但IO线程无法接收它。

You should check stack of your wait-cpu thread (there is even SysRq for this), and kind of I/O. 您应该检查您的wait-cpu线程的堆栈（甚至还有SysRq），以及I / O的种类。 mmap ing of file is the variant of I/O which will be blocked on mmap_sem by fork. mmap荷兰国际集团文件的是I的变体/ O，这将被阻塞上mmap_sem通过叉。

Also you can check the "last used CPU" of the wait-cpu thread, eg in the top monitoring utility, by enabling the thread view ( H key) and adding "Last used CPU" column to output ( fj in older; f scroll to P , enter in newer). 你也可以检查等待CPU线程，例如“最近使用的CPU”，在top监测工具，通过启用线程视图（ H键），并加入“上次使用的CPU”列输出（ fj中老年; f滚动到P ，输入较新的字符）。 I think it is possible that your wait-cpu thread already was on the other CPU, just not allowed (not ready) to run. 我认为您的wait-cpu线程可能已经在另一个CPU上，只是不允许（未准备好）运行。

If you are using fork only to make exec , it can be useful to: 如果仅使用fork来使exec ，则对以下情况可能有用：

either switch to vfork + exec (or just to posix_spawn ). 要么切换到vfork + exec （或者只是切换到posix_spawn ）。 vfork will suspend your process (but may not suspend your other threads, it is dangerous ) until new process will do exec or exit , but execing may be faster than waiting for 65 GB of mmaps to be copied. vfork将挂起您的进程（但可能不会挂起其他线程，这很危险），直到新进程执行exec或exit为止，但是执行速度可能比等待复制65 GB的mmap更快。
or not doing fork from the multithreaded process with several active threads and multi-GB virtual memory. 或者不从具有多个活动线程和多GB虚拟内存的多线程进程中进行派生。 You can create small (without multi-GB mmaped) helper process, communicate with it using ipc or sockets or pipes and ask it to fork and do everything you want. 您可以创建一个小型（没有多GB映射的）帮助程序进程，使用ipc或套接字或管道与之通信，并要求它派生并做您想做的一切。

在多核计算机（Linux OS）中，当进程调度程序将一个进程迁移到另一个CPU时

问题描述

1 个解决方案

解决方案1
2 2014-05-03 03:15:11

在多核计算机（Linux OS）中，当进程调度程序将一个进程迁移到另一个CPU时

问题描述

1 个解决方案

解决方案1 2 2014-05-03 03:15:11

解决方案1
2 2014-05-03 03:15:11