Python 中的多处理与线程

Question

I am learning Multiprocessing and Threading in python to process and create large amount of files, the diagram is shown here diagram我正在学习 python 中的多处理和线程来处理和创建大量文件，图表显示在这里图表

Each of output file depends on the analysis of all input files. output 文件中的每一个都依赖于所有输入文件的分析。

Single processing of the program takes quite a long time, so I tried the following codes:程序的单个处理需要相当长的时间，所以我尝试了以下代码：

(a) multiprocessing (a) 多处理

start = time.time()
process_count = cpu_count()
p = Pool(process_count)
for i in range(process_count):
    p.apply_async(my_read_process_and_write_func, args=(i,w))

p.close()
p.join()
end = time.time()

(b) threading (b) 穿线

start = time.time()
thread_count = cpu_count()
thread_list = [] 

for i in range(0, thread_count):
    t = threading.Thread(target=my_read_process_and_write_func, args=(i,))
    thread_list.append(t)

for t in thread_list:
    t.start()

for t in thread_list:
    t.join()

end = time.time()

I am runing these codes using Python 3.6 on a Windows PC with 8 cores.我在具有 8 个内核的 Windows PC 上使用 Python 3.6 运行这些代码。 However Multiprocessing method takes about the same time as the single-processing method, and Threading method takes about 75% of the single-processing method.然而，多处理方法与单处理方法所用的时间大致相同，而线程方法大约是单处理方法的 75%。

My questions are:我的问题是：

Are my codes correct?我的代码正确吗？

Is there any better way/codes to improve the efficiency?有没有更好的方法/代码来提高效率？ Thanks!谢谢！

Answer 1

Your processing is I/O bound, not CPU bound.您的处理受 I/O 限制，而不是 CPU 限制。 As a result, the fact that you have multiple processes helps little.结果，您拥有多个进程这一事实几乎没有帮助。 Each Python process in multiprocessing is stuck waiting for input or output while the CPU does nothing.多处理中的每个 Python 进程都卡在等待输入或 output 而 CPU 什么都不做。 Increasing the Pool size in multiprocessing should improve performance.增加多处理中的池大小应该可以提高性能。

Answer 2

Follwing Tarik's answer, since my processing is I/O bound, I made serveral copies of input files, then each processing reads and processes different copy of these files.按照 Tarik 的回答，由于我的处理是 I/O 绑定的，我制作了输入文件的多个副本，然后每个处理读取并处理这些文件的不同副本。 Now my codes run 8 times faster.现在我的代码运行速度快了 8 倍。

Answer 3

Now my processing diagram looks like this.现在我的处理图是这样的。 My input files include one index file (about 400MB) and 100 other files(each size=330MB, can be considered as a file pool).我的输入文件包括一个索引文件（大约400MB）和100个其他文件（每个大小=330MB，可以认为是一个文件池）。 In order to generate one output file, index file and all flles within the file pool need to be read.为了生成一个 output 文件，需要读取文件池中的索引文件和所有文件。 (eg First line of index file is 15, then line 15 of each files within the file pool need to be read to generate output file1.) Previously I tried multiprocessing and Threading without making copies, the codes were very slow. （例如索引文件的第一行是15，然后需要读取文件池中每个文件的第15行以生成output file1。）以前我尝试过多处理和线程而不复制，代码很慢。 Then I optimized the codes by making copies of only the index file for each processing, so each processing reads copies of index file individually, and then reads the file pool to generate the output files.然后我通过只为每个处理复制索引文件来优化代码，因此每个处理单独读取索引文件的副本，然后读取文件池以生成 output 文件。 Currently, with 8 cpu cores, multiprocessing with poolsize=8 takes least time.目前，使用 8 个 cpu 内核，poolsize=8 的多处理需要最少的时间。

Python 中的多处理与线程

问题描述

3 个解决方案

解决方案1
2 已采纳 2021-03-30 03:18:23

解决方案2
0 2021-03-30 09:28:24

解决方案3
0 2021-04-02 03:36:24

Python 中的多处理与线程

问题描述

3 个解决方案

解决方案1 2 已采纳 2021-03-30 03:18:23

解决方案2 0 2021-03-30 09:28:24

解决方案3 0 2021-04-02 03:36:24

解决方案1
2 已采纳 2021-03-30 03:18:23

解决方案2
0 2021-03-30 09:28:24

解决方案3
0 2021-04-02 03:36:24