Python 中的多線程文件傳輸？

Question

我手頭有一項特殊的小任務，但我不知道如何最好地實施解決方案。

我有三個工作站通過帶寬為 40gbps 的 InfiniBand 連接到運行 Ubuntu 20.04 LTS 的 NAS。 這款 NAS 配備 2TB NVMe SSD 作為寫入緩存，以及 7 個 RAID0 單元作為主存儲。

這些工作站會將原始數據輸出到該 NAS 以供以后使用，這些機器中的每台每天都會輸出大約 6TB 的數據文件，每個文件的大小在 100 到 300 GB 之間。 為了防止網絡變得過於擁擠，我先將它們 output 的數據放到 NVMe 緩存中，然后我計划從那里分發數據文件 - 每個 RAID0 單元同時只有一個文件，以最大化磁盤 IO。 例如，file1 到 array0，file2 到 array1，file3 到 array2，等等。

現在我正在 NAS 端編寫一個腳本（最好作為systemd服務，但我可以使用nohup ）來監控緩存，並將文件發送到這些 RAID arrays。

這是我想出的，感謝這篇文章，它非常接近我的目標。

import queue, threading, os, time
import shutil

bfr_drive = '/home/test_folder' # cache
ext = ".dat" # data file extension
array = 0 # simluated array as t0-t6
fileList = [] # list of files to be moved from cache to storage
destPath = '/home/test_folder/t'
fileQueue = queue.Queue()


class ThreadedCopy:
    totalFiles = 0
    copyCount = 0
    array = 0
    lock = threading.Lock()

    def __init__(self):
        for file_name in os.listdir(bfr_drive):
            if file_name.endswith(ext):
                fileList.append(os.path.join(bfr_drive, file_name))
                fileList.sort()

        self.totalFiles = len(fileList)

        print (str(self.totalFiles) + " files to copy.")
        self.threadWorkerCopy(fileList)


    def CopyWorker(self):
        global array
        while True:
            fileName = fileQueue.get()
            shutil.copy(fileName, destPath+str(array))
            array += 1
            if array > 6:
                array = 0
            fileQueue.task_done()

            with self.lock:
                self.copyCount += 1
                
                percent = (self.copyCount * 100) / self.totalFiles
                
                print (str(percent) + " percent copied.")

    def threadWorkerCopy(self, fileNameList):
        # global array
        for i in range(4):
            t = threading.Thread(target=self.CopyWorker)
            t.daemon = True
            t.start()
            # array += 1
            # if array > 6:
                # array = 0
            print ("current array is:" + str(array)) # output prints array0 for 4 times, did not iterate
            
          
        for fileName in fileNameList:
            fileQueue.put(fileName)
        fileQueue.join()

ThreadedCopy()

現在，Python 腳本可以成功分發文件，但只有在for i in range(4)的數字之后。 例如，如果我將其設置為 4，那么工作人員將對前 4 個文件使用相同的路徑（array0），然后他們才會開始遍歷 arrays 到 1、2、3 等。

有人能指出我如何分發文件嗎？ 我認為我正朝着正確的方向前進，但是，我無法理解為什么這些工人一開始就被困在同一個目錄中。

編輯：我有路徑迭代的早期版本的代碼是在產卵過程threadWorkerCopy 。 我現在把它移到了實際的工作人員 function 上，即CopyWorker 。 問題仍然存在。

Answer 1

問題是您不會在工作線程中生成新的array值，而只是在threadWorkerCopy中創建線程時。
結果將取決於系統上的實際時間。 每個工作線程在讀取值時都會使用array的值。 這可能與threadWorkerCopy遞增值或之后並發，因此您可能會在不同目錄中獲取文件或全部在同一目錄中。

要為每個復制進程獲取一個新數字， array中的數字必須在工作線程中遞增。 在這種情況下，您必須防止兩個或多個線程同時對array進行並發訪問。 您可以使用另一個鎖來實現這一點。

為了測試，我將目錄列表替換為示例文件名的硬編碼列表，並將復制替換為打印值。

import queue, threading, os, time
import shutil

bfr_drive = '/home/test_folder' # cache
ext = ".dat" # data file extension
array = 0 # simluated array as t0-t6
fileList = [] # list of files to be moved from cache to storage
destPath = '/home/test_folder/t'
fileQueue = queue.Queue()


class ThreadedCopy:
    totalFiles = 0
    copyCount = 0
    array = 0
    lock = threading.Lock()
    lockArray = threading.Lock()

    def __init__(self):
        # directory listing replaced with hard-coded list for testing
        for file_name in [ 'foo.dat', 'bar.dat', 'baz.dat', 'a.dat', 'b.dat', 'c.dat', 'd.dat', 'e.dat', 'f.dat', 'g.dat' ] :
        #for file_name in os.listdir(bfr_drive):
            if file_name.endswith(ext):
                fileList.append(os.path.join(bfr_drive, file_name))
                fileList.sort()

        self.totalFiles = len(fileList)

        print (str(self.totalFiles) + " files to copy.")
        self.threadWorkerCopy(fileList)


    def CopyWorker(self):
        global array
        while True:
            fileName = fileQueue.get()

            with self.lockArray:
                myArray = array
                array += 1
                if array > 6:
                    array = 0

            # actual copying replaced with output for testing
            print('copying', fileName, destPath+str(myArray))
            #shutil.copy(fileName, destPath+str(myArray))

            with self.lock:
                self.copyCount += 1

                percent = (self.copyCount * 100) / self.totalFiles

                print (str(percent) + " percent copied.")

            # moved to end because otherwise main thread may terminate before the workers
            fileQueue.task_done()

    def threadWorkerCopy(self, fileNameList):
        for i in range(4):
            t = threading.Thread(target=self.CopyWorker)
            t.daemon = True
            t.start()

        for fileName in fileNameList:
            fileQueue.put(fileName)
        fileQueue.join()

ThreadedCopy()

這會打印出類似這樣的內容（可能會在不同的運行之間發生變化）：

10 files to copy.
copying /home/test_folder\a.dat /home/test_folder/t0
10.0 percent copied.
copying /home/test_folder\baz.dat /home/test_folder/t3
20.0 percent copied.
copying /home/test_folder\b.dat /home/test_folder/t1
copying /home/test_folder\c.dat /home/test_folder/t4
copying /home/test_folder\bar.dat /home/test_folder/t2
copying /home/test_folder\d.dat /home/test_folder/t5
30.0 percent copied.
copying /home/test_folder\e.dat /home/test_folder/t6
40.0 percent copied.
copying /home/test_folder\f.dat /home/test_folder/t0
50.0 percent copied.
copying /home/test_folder\foo.dat /home/test_folder/t1
60.0 percent copied.
copying /home/test_folder\g.dat /home/test_folder/t2
70.0 percent copied.
80.0 percent copied.
90.0 percent copied.
100.0 percent copied.

筆記：

我將行fileQueue.task_done()移到CopyWorker的末尾。 否則我不會得到所有百分比 output 行，有時會出現錯誤消息

Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads

也許您應該在主線程結束之前等待所有工作線程結束。

我沒有檢查代碼中是否還有其他錯誤。

更改問題中的代碼后進行編輯：

修改后的代碼仍然存在工作線程在fileQueue.task_done()之后仍然會做一些output的問題，這樣主線程就可以在worker之前結束。

修改后的代碼在工作線程訪問array時包含競爭條件，因此可能出現意外行為。

Python 中的多線程文件傳輸？

問題描述

1 個解決方案

解決方案1
1 已采納 2021-05-17 18:44:00

Python 中的多線程文件傳輸？

問題描述

1 個解決方案

解決方案1 1 已采納 2021-05-17 18:44:00

解決方案1
1 已采納 2021-05-17 18:44:00