在执行I / O绑定任务时，20个进程中的400个线程在4个进程中超过400个线程

Question

Experimental Code 实验代码

Here is the experimental code that can launch a specified number of worker processes and then launch a specified number of worker threads within each process and perform the task of fetching URLs: 以下是可以启动指定数量的工作进程，然后在每个进程中启动指定数量的工作线程并执行获取URL任务的实验代码：

import multiprocessing
import sys
import time
import threading
import urllib.request


def main():
    processes = int(sys.argv[1])
    threads = int(sys.argv[2])
    urls = int(sys.argv[3])

    # Start process workers.
    in_q = multiprocessing.Queue()
    process_workers = []
    for _ in range(processes):
        w = multiprocessing.Process(target=process_worker, args=(threads, in_q))
        w.start()
        process_workers.append(w)

    start_time = time.time()

    # Feed work.
    for n in range(urls):
        in_q.put('http://www.example.com/?n={}'.format(n))

    # Send sentinel for each thread worker to quit.
    for _ in range(processes * threads):
        in_q.put(None)

    # Wait for workers to terminate.
    for w in process_workers:
        w.join()

    # Print time consumed and fetch speed.
    total_time = time.time() - start_time
    fetch_speed = urls / total_time
    print('{} x {} workers => {:.3} s, {:.1f} URLs/s'
          .format(processes, threads, total_time, fetch_speed))



def process_worker(threads, in_q):
    # Start thread workers.
    thread_workers = []
    for _ in range(threads):
        w = threading.Thread(target=thread_worker, args=(in_q,))
        w.start()
        thread_workers.append(w)

    # Wait for thread workers to terminate.
    for w in thread_workers:
        w.join()


def thread_worker(in_q):
    # Each thread performs the actual work. In this case, we will assume
    # that the work is to fetch a given URL.
    while True:
        url = in_q.get()
        if url is None:
            break

        with urllib.request.urlopen(url) as u:
            pass # Do nothing
            # print('{} - {} {}'.format(url, u.getcode(), u.reason))


if __name__ == '__main__':
    main()

Here is how I run this program: 以下是我运行此程序的方法：

python3 foo.py <PROCESSES> <THREADS> <URLS>

For example, python3 foo.py 20 20 10000 creates 20 worker processes with 20 threads in each worker process (thus a total of 400 worker threads) and fetches 10000 URLs. 例如， python3 foo.py 20 20 10000创建20个工作进程，每个工作进程中有20个线程（因此总共有400个工作线程）并获取10000个URL。 In the end, this program prints how much time it took to fetch the URLs and how many URLs it fetched per second on an average. 最后，该程序会打印获取URL所花费的时间以及平均每秒获取的URL数。

Note that in all cases I am really hitting a URL of www.example.com domain, ie, www.example.com is not merely a placeholder. 请注意，在所有情况下，我实际上都在访问www.example.com域的URL，即www.example.com不仅仅是占位符。 In other words, I run the above code unmodified. 换句话说，我没有修改上面的代码。

Environment 环境

I am testing this code on a Linode virtual private server that has 8 GB RAM and 4 CPUs. 我在具有8 GB RAM和4个CPU的Linode虚拟专用服务器上测试此代码。 It is running Debian 9. 它正在运行Debian 9。

$ cat /etc/debian_version 
9.9

$ python3
Python 3.5.3 (default, Sep 27 2018, 17:25:39) 
[GCC 6.3.0 20170516] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7987          67        7834          10          85        7734
Swap:           511           0         511

$ nproc
4

Case 1: 20 Processes x 20 Threads 情况1:20处理x 20个线程

Here are a few trial runs with 400 worker threads distributed between 20 worker processes (ie, 20 worker threads in each of the 20 worker processes). 以下是一些试运行，其中有400个工作线程分布在20个工作进程之间（即20个工作进程中每个进程中有20个工作线程）。 In each trial, 10,000 URLs are fetched. 在每个试验中，提取10,000个URL。

Here are the results: 结果如下：

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.12 s, 1954.6 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.28 s, 1895.5 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.22 s, 1914.2 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.38 s, 1859.8 URLs/s

$ python3 foo.py 20 20 10000
20 x 20 workers => 5.19 s, 1925.2 URLs/s

We can see that about 1900 URLs are fetched per second on an average. 我们可以看到平均每秒大约提取1900个URL。 When I monitor the CPU usage with the top command, I see that each python3 worker process consumes about 10% to 15% CPU. 当我使用top命令监视CPU使用情况时，我发现每个python3工作进程消耗大约10％到15％的CPU。

Case 2: 4 Processes x 100 Threads 情况2：4处理x 100个线程

Now I thought that I only have 4 CPUs. 现在我以为我只有4个CPU。 Even if I launch 20 worker processes, at most only 4 processes can run at any point in physical time. 即使我启动了20个工作进程，最多只有4个进程可以在物理时间的任何时刻运行。 Further due to global interpreter lock (GIL), only one thread in each process (thus a total of 4 threads at most) can run at any point in physical time. 此外，由于全局解释器锁定（GIL），每个进程中只有一个线程（因此最多总共4个线程）可以在物理时间的任何点运行。

Therefore, I thought if I reduce the number of processes to 4 and increase the number of threads per process to 100, so that the total number of threads still remain 400, the performance should not deteriorate. 因此，我想如果我将进程数减少到4并将每个进程的线程数增加到100，那么线程总数仍然保持为400，性能不应该恶化。

But the test results show that 4 processes containing 100 threads each consistently perform worse than 20 processes containing 20 threads each. 但测试结果表明，包含100个线程的4个进程每个执行程序的性能都比20个进程要差20个进程。

$ python3 foo.py 4 100 10000
4 x 100 workers => 9.2 s, 1086.4 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.9 s, 916.5 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 7.8 s, 1282.2 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 10.3 s, 972.3 URLs/s

$ python3 foo.py 4 100 10000
4 x 100 workers => 6.37 s, 1570.9 URLs/s

The CPU usage is between 40% to 60% for each python3 worker process. 每个python3工作进程的CPU使用率在40％到60％之间。

Case 3: 1 Process x 400 Threads 案例3：1处理x 400线程

Just for comparison, I am recording the fact that both case 1 and case 2 outperform the case where we have all 400 threads in a single process. 仅仅为了比较，我记录的事实是案例1和案例2都胜过我们在一个进程中拥有所有400个线程的情况。 This is most certainly due to the global interpreter lock (GIL). 这肯定是由于全球解释器锁（GIL）。

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.5 s, 742.8 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 14.3 s, 697.5 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 761.3 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 15.6 s, 640.4 URLs/s

$ python3 foo.py 1 400 10000
1 x 400 workers => 13.1 s, 764.4 URLs/s

The CPU usage is between 120% and 125% for the single python3 worker process. 单个python3工作进程的CPU使用率介于120％和125％之间。

Case 4: 400 Processes x 1 Thread 案例4：400个进程x 1个线程

Again, just for comparison, here is how the results look when there are 400 processes, each with a single thread. 同样，仅用于比较，以下是当有400个进程（每个进程都有一个线程）时结果的外观。

$ python3 foo.py 400 1 10000
400 x 1 workers => 14.0 s, 715.0 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 6.1 s, 1638.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.08 s, 1413.1 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 7.23 s, 1382.9 URLs/s

$ python3 foo.py 400 1 10000
400 x 1 workers => 11.3 s, 882.9 URLs/s

The CPU usage is between 1% to 3% for each python3 worker process. 每个python3工作进程的CPU使用率在1％到3％之间。

Summary 摘要

Picking the median result from each case, we get this summary: 从每个案例中挑选中位数结果，我们得到以下摘要：

Case 1:  20 x  20 workers => 5.22 s, 1914.2 URLs/s ( 10% to  15% CPU/process)
Case 2:   4 x 100 workers => 9.20 s, 1086.4 URLs/s ( 40% to  60% CPU/process)
Case 3:   1 x 400 workers => 13.5 s,  742.8 URLs/s (120% to 125% CPU/process)
Case 4: 400 x   1 workers => 7.23 s, 1382.9 URLs/s (  1% to   3% CPU/process

Question 题

Why does 20 processes x 20 threads perform better than 4 processes x 100 threads even if I have only 4 CPUs? 为什么20个进程x 20个线程的性能优于4个进程x 100个线程，即使我只有4个CPU？

Answer 1

Your task is I/O-bound rather than CPU-bound: threads spend most of the time in sleep state waiting for network data and such rather than using the CPU. 您的任务是I / O绑定而不是CPU绑定：线程大部分时间都处于睡眠状态，等待网络数据而不是CPU。

So adding more threads than CPUs works here as long as I/O is still the bottleneck. 因此，只要I / O仍然是瓶颈，添加比CPU更多的线程就可以在这里工作。 The effect will only subside once there are so many threads that enough of them are ready at a time to start actively competing for CPU cycles (or when your network bandwidth is exhausted, whichever comes first). 一旦有这么多线程，只有足够的线程准备好开始积极地竞争CPU周期（或者当你的网络带宽耗尽时，以先到者为准），效果才会消退。

As for why 20 threads per process is faster than 100 threads per process: this is most likely due to CPython's GIL. 至于为什么每个进程20个线程比每个进程100个线程快：这很可能是由于CPython的GIL。 Python threads in the same process need to wait not only for I/O but for each other, too. 同一进程中的Python线程不仅需要等待I / O，还需要等待彼此。
When dealing with I/O, Python machinery: 在处理I / O时，Python机器：

Converts all Python objects involved into C objects (in many cases, this can be done without physically copying the data) 将所有涉及的Python对象转换为C对象（在许多情况下，这可以在不物理复制数据的情况下完成）
Releases the GIL 发布GIL
Perform the I/O in C (which involves waiting for it for arbitrary time) 在C中执行I / O（包括等待任意时间）
Reacquires the GIL 重新获取GIL
Converts the result to a Python object if applicable 如果适用，将结果转换为Python对象

If there are enough threads in the same process, it becomes increasigly likely that another one is active when step 4 is reached, causing an additional random delay. 如果在同一进程中有足够的线程，则在达到步骤4时，另一个线程很可能变得活跃，从而导致额外的随机延迟。

Now, when it comes to lots of processes, other factors come into play like memory swapping (since unlike threads, processes running the same code don't share memory) (I'm pretty sure there are other delays from lots of processes as opposed to threads competing for resources but can't point it from the top of my head). 现在，当涉及到许多进程时，其他因素就像内存交换一样起作用（因为与线程不同，运行相同代码的进程不共享内存）（我很确定很多进程存在其他延迟，而不是线程竞争资源但不能从头顶指出它。 That's why the performance becomes unstable. 这就是性能变得不稳定的原因。

在执行I / O绑定任务时，20个进程中的400个线程在4个进程中超过400个线程

问题描述

Experimental Code 实验代码

Environment 环境

Case 1: 20 Processes x 20 Threads 情况1:20处理x 20个线程

Case 2: 4 Processes x 100 Threads 情况2：4处理x 100个线程

Case 3: 1 Process x 400 Threads 案例3：1处理x 400线程

Case 4: 400 Processes x 1 Thread 案例4：400个进程x 1个线程

Summary 摘要

Question 题

1 个解决方案

解决方案1
2 2019-05-23 10:38:36

在执行I / O绑定任务时，20个进程中的400个线程在4个进程中超过400个线程

问题描述

Experimental Code 实验代码

Environment 环境

Case 1: 20 Processes x 20 Threads 情况1:20处理x 20个线程

Case 2: 4 Processes x 100 Threads 情况2：4处理x 100个线程

Case 3: 1 Process x 400 Threads 案例3：1处理x 400线程

Case 4: 400 Processes x 1 Thread 案例4：400个进程x 1个线程

Summary 摘要

Question 题

1 个解决方案

解决方案1 2 2019-05-23 10:38:36

解决方案1
2 2019-05-23 10:38:36