我將如何為字典中的每個元素創建一個 multiprocessing.Lock() ？

Question

我正在嘗試創建一個基於多處理的程序，該程序具有文件緩存以加快速度。 該緩存在程序開始時為空，但隨后在對數據發出請求時被填充。 還有一組額外的文件，它們是加載到緩存中的文件的未處理版本。 我正在使用的多處理代碼如下所示：

# file_caches is a multiprocessing.Manager.dict()
# file_cache_lock is a multiprocessing.Lock()

    if file_path in file_caches:
        # We have a cache
        file_cache_lock.acquire()
        cached = file_caches[file_path][:]
        file_cache_lock.release()

        data1 = cached[0]
        data2 = cached[1]
    elif file_path.exists():
        data1 = np.load(file_path)
        data2 = get_data2()

        if file_cache_lock.acquire(False) and (file_path not in file_caches): # Non-blocking acquire
            file_caches[file_path] = (data1, data2)
            file_cache_lock.release()
    else:
        # Load original file
        data1, data2 = read_and_process(original_file_path)

        # save data
        file_path.parent.mkdir(parents=True, exist_ok=True)
        with open(file_path, "wb") as f:
            np.save(f, data1, allow_pickle=False)
        
        if file_cache_lock.acquire(False) and (file_path not in file_caches): # Non-blocking acquire
            file_caches[file_path] = (data1, data2)
            file_cache_lock.release()

但是，如果兩個（或更多）進程試圖將同一個文件稍稍分開，這可能會導致競爭條件。

假設進程A去運行這段代碼，發現沒有緩存，要緩存的文件還沒有創建，所以它去處理原始文件，並創建要緩存的后備文件。 進程 B 在進程 A 創建文件之后但在完成寫入之前出現。 進程 B 將在elif情況下結束，它將開始讀取不完整的寫入數據。 顯然，這是一個問題。

因此，我想在緩存字典中的元組中添加一個額外的字段，即 multiprocessing.Lock()，這樣我就不會阻止正在寫入和讀取的其他數據，同時還能防止競爭條件。 然而，這並不是那么簡單，因為我得到了錯誤：

Lock objects should only be shared between processes through inheritance

那么，有沒有辦法通過這種方式動態創建鎖添加到字典中呢？ 或者有更好的方法來解決這個問題嗎？

Answer 1

我會使用隊列而不是你的字典。 您的進程從同一個隊列中讀取它們的任務。 一開始，您用所有未處理的文件填充隊列。 這些文件得到處理，完成后活動進程將處理后的文件名再次放入同一隊列。 由於隊列是按順序清空的，因此您永遠不會在不完整的數據上出現競爭條件。

在偽代碼中：

def input_polling(in_queue):
    # polls the input queue and stops when "STOP" is send
    # will block until element becomes available
    for a in iter(in_queue.get, 'STOP'): 
        if a == unprocessed:
            process(a)
            in_queue.put(a_processed)
        if a == processed:
            process(a)
      
def main(args):
    in_queue = mp.Queue()
    for n in range(4):
        inThread = multiprocessing.Process(target=input_polling,args=[in_queue])
        inThread.start()  
    for element in list_unprocessed_files:
        in_queue.put(element)

創建並啟動您的流程后，它們處於空閑狀態，直到將某些內容放入隊列中。 稍后可以通過將“STOP”放入隊列來停止進程。

Answer 2

我的第一個觀察是，您不需要獲取鎖來測試文件路徑是否在緩存中，如果是，則獲取值（請參閱我的第二個代碼版本）。

但避免競爭條件的最簡單（不一定是最好）的選擇是僅在獲取鎖后執行所有緩存邏輯，如下所示（但第二個版本中有更好的選擇）：

def worker(file_path, original_file_path, file_caches, file_cache_lock):
    with file_cache_lock:
        if file_path in file_caches:
            # Found in cache!
            data1, data2 = file_caches[file_path]
        elif file_path.exists():
            data1 = np.load(file_path)
            data2 = get_data2()
            file_caches[file_path] = (data1, data2)
        else:
            # Load original file
            data1, data2 = read_and_process(original_file_path)
            # save data
            file_path.parent.mkdir(parents=True, exist_ok=True)
            with open(file_path, "wb") as f:
                np.save(f, data1, allow_pickle=False)
                file_caches[file_path] = (data1, data2)
                file_cache_lock.release()
    ... # rest of code that uses data1 and data2 omitted

您可能會擔心，如果數據不在緩存中，那么您將持有讀取並可能寫入多個文件的鎖。 因此，如果您不希望冒着執行一些不必要的文件 I/O 的風險阻止可能試圖獲取鎖的其他進程，那么以下代碼將對緩存進行最小鎖定。 最終，唯一需要鎖定的是當進程將文件寫入file_path或當進程從file_path加載文件以避免讀取部分創建的文件時：

def worker(file_path, original_file_path, file_caches, file_cache_lock):
    if file_path in file_caches:
        # Found in cache!
        data1, data2 = file_caches[file_path]
    elif file_path.exists():
        # Now we must acquire the lock in case the file is being written:
        with file_cache_lock:
            # Check one more time to see if loading is still
            # necessary:
            if file_path in file_caches:
                # Another process has created the cache entry:
                data1, data2 = file_caches[file_path]
            else:    
                data1 = np.load(file_path)
                data2 = get_data2()
                file_caches[file_path] = (data1, data2)
    else:
        # Load original file
        data1, data2 = read_and_process(original_file_path)
        # Did someone else create the cache entry in the meanwhile?
        # Now we must acquire the lock:
        with file_cache_lock:
            # Check one more time to see if write is still necessary:
            if not file_path in file_caches:
                # It's okay to update the cache now:
                file_caches[file_path] = (data1, data2)        
                file_path.parent.mkdir(parents=True, exist_ok=True)        
                with open(file_path, "wb") as f:
                    np.save(f, data1, allow_pickle=False)
    ... # rest of code that uses data1 and data2 omitted

我將如何為字典中的每個元素創建一個 multiprocessing.Lock() ？

問題描述

2 個解決方案

解決方案1
0 2022-12-16 07:22:38

解決方案2
0 2022-12-18 14:51:55

我將如何為字典中的每個元素創建一個 multiprocessing.Lock() ？

問題描述

2 個解決方案

解決方案1 0 2022-12-16 07:22:38

解決方案2 0 2022-12-18 14:51:55

解決方案1
0 2022-12-16 07:22:38

解決方案2
0 2022-12-18 14:51:55