設置多線程 python 3

Question

我有一個 python 腳本，它獲取輸入文件列表並在這些文件上運行 cmd。 列表中平均大約有 6,000 個文件。 （因批次而異）。 這個基於 cmd 的程序平均每個輸入文件需要大約 1-2 分鍾才能完成，因此最終處理整個列表大約需要幾天時間。

我希望使用多線程在我的幾個處理器之間拆分列表，以便可以串聯處理列表並最終更快！

輸入再次是文件列表，並且 output 被寫入與輸入文件相同的目錄，但為 inputx_output.csv

我似乎無法讓多線程工作。 這是當前沒有多線程的代碼段。 每次我查看一些多線程引用時，它似乎都不能很好地適應我的代碼。

這是一個段，不包含單獨運行所需的所有引用/輸入。

import shutil as sh
Input_File ='c:/Users/James_Mann1/Desktop/TestBench'

import os
try:
    os.chdir(Input_File)
except:
    catch = 1
try:
    sh.rmtree("Cas-Out")
except:
    catch = 1
try:
    os.mkdir("Cas-Out")
except:
    catch = 1 
#print(os.getcwd())
os.chdir("Cas-OffInput")
path = os.getcwd()
Input_Casoff = os.listdir(path)
src = Input_File + "/cas-offinder.exe"
dst = Input_File + "/Cas-OffInput/cas-offinder.exe"
sh.copyfile(src, dst)
for value in Input_Casoff:
    subprocess.call("cas-offinder " + value + " G0 " + " " + value + "out.txt")
    #os.system("cas-offinder " + value + " G0 " + " " + value + "out.txt")

我不確定從哪里開始？ 我似乎找不到寫這篇文章的好參考。 我要做的就是獲取列表條目並在它們上串聯運行 cmd，因此該過程完成得更快。 output 由 cmd 生成，因此無需捕獲任何內容。

編輯，我能夠解決最初的問題。 發布在下面...現在我想返回一個計數，它告訴我還剩多少。

import os
from queue import Queue
from threading import Thread
from time import sleep
import shutil as sh
import time
#%%
import shutil as sh
Input_File ='c:/Users/James_Mann1/Desktop/TestBench'

import os
try:
    os.chdir(Input_File)
except:
    catch = 1
try:
    sh.rmtree("Cas-Out")
except:
    catch = 1
try:
    os.mkdir("Cas-Out")
except:
    catch = 1
#print(os.getcwd())
os.chdir("Cas-OffInput")
path = os.getcwd()
Input_Casoff = os.listdir(path)
src = Input_File + "/cas-offinder.exe"
dst = Input_File + "/Cas-OffInput/cas-offinder.exe"
sh.copyfile(src, dst)

Task_Count = len(Input_Casoff)

from multiprocessing import Pool

def Cas_off(x):
    os.system("cas-offinder " + x + " G0 " + " " + x + "out.txt")
    #print(x)
    return x

if __name__ == '__main__':
    with Pool(8) as p:
        print(p.map(Cas_off, Input_Casoff))

關於實施獲取剩余條目的方法有什么建議嗎？ 所以

x/4000 已完成或正在處理 x/4000。

謝謝！

Answer 1

一種方法是使用mpi4py ，即 MPI（消息傳遞接口）的 python 版本，它使您能夠進行並行計算。 MPI 是一個相當大的庫，使您能夠在多個處理器上運行程序，甚至可以在這些處理器之間進行通信。

在 MAC 上，我使用 brew 安裝了它：

brew install open-mpi

然后，您可以使用 pip 安裝 mpi4py：

pip3 install mpi4py

那么你的代碼結構應該是這樣的：

from mpi4py import MPI

comm = MPI.COMM_WORLD # MPI communicator
size = comm.Get_size() # Number of processors
rank = comm.Get_rank() # Rank of the processor

# Input and output file names
nb_files = 10
input_files = [f'input_file_{i}.txt' for i in range(nb_files)]
output_files = [f'output_file_{i}.txt' for i in range(nb_files)]

# Loop on all files to process
for i in range(len(input_files)):
    # The current processor will process only files at indexes
    # equal to his rank modulo the total number of processors
    # else it will skip the file.
    if (i%size==rank):
        # Here processor rank must:
        # - read file input_files[i]
        # - process the data
        # - write file output_files[i]
        print(f'CPU{rank} transforms "{input_files[i]}" into "{output_files[i]}".') # dummy code

您可以使用以下命令執行代碼（將 4 替換為所需的 CPU 數量，並將 script.py 替換為 python 程序文件的名稱）：

mpiexec -n 4 python3 script.py

Output：

% mpiexec -n 4 python3 script.py
CPU3 transforms "input_file_3.txt" into "output_file_3.txt".
CPU3 transforms "input_file_7.txt" into "output_file_7.txt".
CPU1 transforms "input_file_1.txt" into "output_file_1.txt".
CPU1 transforms "input_file_5.txt" into "output_file_5.txt".
CPU1 transforms "input_file_9.txt" into "output_file_9.txt".
CPU0 transforms "input_file_0.txt" into "output_file_0.txt".
CPU0 transforms "input_file_4.txt" into "output_file_4.txt".
CPU0 transforms "input_file_8.txt" into "output_file_8.txt".
CPU2 transforms "input_file_2.txt" into "output_file_2.txt".
CPU2 transforms "input_file_6.txt" into "output_file_6.txt".

小心避免多個處理器處理同一個文件以避免沖突（所有文件名應該不同）。 同樣，MPI 是一個非常大的庫，您可以想象使用它添加通信（例如每分鍾）以減少所有處理器之間完成的文件數量。

希望這可以幫助。

設置多線程 python 3

問題描述

1 個解決方案

解決方案1
0 2020-06-26 19:32:04

設置多線程 python 3

問題描述

1 個解決方案

解決方案1 0 2020-06-26 19:32:04

解決方案1
0 2020-06-26 19:32:04