Python 多處理循環/其他可迭代對象

Question

我試圖找出一種方法來使用多處理包來減少處理我擁有的某些代碼所需的時間。

本質上，我使用多個嵌套的 for 循環完成了匹配計算，我想充分利用我可用的 12 核處理器。 我找到了一些關於 for 循環和多處理的文檔和答案，但由於某種原因，它似乎並沒有在我的腦海中響起。 反正...

我有兩個大型數據框，我已將它們轉換為列表列表，以便能夠更輕松地對其進行迭代。

它們都遵循相同的格式，但具有不同的值 - 例如，DF/列表看起來像這樣

TT和CT：

|user_id| hour1_avg | hour2_avg |....| hour24_avg| hour1_stdev | ... | hour24_stdev | 
|-------|-----------|-----------|----|-----------|-------------|-----|--------------|
| 12345 |   1.34    |   2.14    |----|   3.24    |    .942     |-----|      .834    | 
| 54321 |   2.14    |   3.10    |----|   6.26    |    .826     |-----|      .018    |

然后使用.values.to_list()將其轉換為列表列表。

TTL 和 CTL：

[[12345, 1.34, 2.14,...3.24,.942,....834],[54321, 2.14, 3.10,...6.26, .826,....018], [etc]]

我的代碼遍歷兩個列表列表，計算每個小時值的計算，然后如果所有 24 小時都滿足if語句中的條件， if配對結果放入pairs列表中。 不符合條件的可以被淘汰。

pairs = [] #output for for loops

start_time = time.time()
for idx, a in enumerate(ttl): # iterate through primary list of list
    if idx % 12 != 0: #used to separate for 12 processors (0-11 to split processes manually)
        continue
    for b in ctl: # iterate through second list of list 
        i = 0
        tval_avg = [] # used to calculate average between computed variables in the loop
        for c in range(1,31): # iterate through hour avg and stdev 
            i += 1
            tval = np.absolute((a[c] - b[c])/np.sqrt((a[c+24]**2/31)+(b[c+24]**2/31))) 
            if math.isnan(tval) or tval > 2.04:
                break
            else:
                tval_avg.append(tval)
                if i == 24:  # checks to make sure each hour matches criteria to before being returned
                    pairs.append([a[0], b[0], a[2], a[3], np.mean(tval_avg)])
    if idx % 10 == 0 :
        print(idx) # check progress of loop
print("--- %s seconds ---" % (time.time() - start_time)) # show total time at the end

如果我在 spyder 中手動打開 12 個內核並將 0-11 分配給if idx %語句並運行它們（允許我使用更多處理器），則此方法有效。 我的目標是在一個內核中運行所有內容，使用多處理分配 12 個（或任何有效的）不同的“作業”——每個處理器一個，並將結果吐到單個數據幀中。 這種類型的代碼可以實現嗎？ 如果是這樣，我需要進行哪些類型的更改？

對不起，如果這很復雜。 如果需要，我很樂意進一步解釋。

我已經在 SO 周圍搜索了與我的特定問題類似的東西，但沒有找到任何東西。 我也無法理解多處理，以及如何將其應用於此特定場景，因此非常感謝任何幫助！

Answer 1

在我的帶有大 DF 的筆記本上運行不到 1.5 分鍾。 但是，非多處理變體並沒有慢很多。
編輯：實際上只有當閾值如此之高以至於沒有（或很少）找到對時，這才是正確的。 如果你有很多對，ipc 開銷很大，非多處理變體要快得多。 至少對我來說。

我已經通過將過濾器從>2.04更改為>20來驗證結果，這更適合我創建的統一樣本。
我們的兩種算法似乎都生成了相同的對列表（一旦我確定了范圍並刪除了idx % 12部分）。

順便說一句，我使用tqdm來可視化進度，這是一個非常方便的庫。

import math

import pandas as pd
import numpy as np
import tqdm
import multiprocessing

avg_cols = [f"hour{i}_avg" for i in range(1, 25)]
stdev_cols = [f"hour{i}_stdev" for i in range(1, 25)]
columns = ["userid"] + avg_cols + stdev_cols
np.random.seed(23)
# threshod = 2.04
# rands_tt = np.random.rand(3000, 49)
# rands_ct = np.random.rand(112000, 49)
threshold = 20
rands_tt = np.random.rand(2, 49)
rands_ct = np.random.rand(10, 49)

multipliers = np.repeat([1000000, 5, 2], [1, 24, 24])[None, :]

TT = pd.DataFrame(data=rands_tt * multipliers, columns=columns)
CT = pd.DataFrame(data=rands_ct * multipliers, columns=columns)

pairs = []

tt_complete = TT.loc[:, columns].to_numpy()
ct_complete = CT.loc[:, columns].to_numpy()

avg = slice(1, 25)
stdev = slice(25, 49)
# do the **2/31 calculations only once
tt_complete[:, stdev] **= 2
tt_complete[:, stdev] /= 31

ct_complete[:, stdev] **= 2
ct_complete[:, stdev] /= 31


def find_pairs(tt_row):
    tvals = np.absolute(
        (tt_row[None, avg] - ct_complete[:, avg]) / np.sqrt(tt_row[None, stdev] + ct_complete[:, stdev])
    )

    # nan will propagate itself as max and when compared to 2.04 will return False
    valid_tval_idxs = np.where(tvals.max(axis=1) <= threshold)[0]
    mean_tvals = tvals.mean(axis=1)

    return [[tt_row[0], ct_complete[i, 0], tt_row[2], tt_row[3], mean_tvals[i]] for i in valid_tval_idxs]


# for tt_row in tqdm.tqdm(tt_complete):
#     pairs.extend(find_pairs(tt_row))


with multiprocessing.Pool(6) as pool:
    pairlist_iterable = pool.imap_unordered(find_pairs, tt_complete, chunksize=200)
    for pairlist in tqdm.tqdm(pairlist_iterable, total=len(tt_complete)):
        pairs.extend(pairlist)


ttl = TT.to_numpy().tolist()
ctl = CT.to_numpy().tolist()

pairs2 = []  # output for for loops

for idx, a in enumerate(ttl):  # iterate through primary list of list

    for b in ctl:  # iterate through second list of list
        i = 0
        tval_avg = []  # used to calculate average between computed variables in the loop
        for c in range(1, 25):  # iterate through hour avg and stdev
            i += 1
            tval = np.absolute((a[c] - b[c]) / np.sqrt((a[c + 24] ** 2 / 31) + (b[c + 24] ** 2 / 31)))
            if math.isnan(tval) or tval > threshold:
                break
            else:
                tval_avg.append(tval)
                if i == 24:  # checks to make sure each hour matches criteria to before being returned
                    pairs2.append([a[0], b[0], a[2], a[3], np.mean(tval_avg)])

print(pairs)   
print(pairs2)
print(pairs == pairs2)

輸出是

100%|██████████| 2/2 [00:00<00:00, 2150.93it/s]
[[517297.88384658925, 878265.8552092713, 3.8272987969845347, 1.4119792198355636, 6.95265573421445]]
[[517297.88384658925, 878265.8552092713, 3.8272987969845347, 1.4119792198355636, 6.95265573421445]]
True

Answer 2

您的外循環已結束ttl 。 將該循環體中的代碼移動到一個輔助函數中，該函數接受a作為輸入並返回(tval_avg, pairs) 。

然后使用map重復調用該助手。

返回元組將被序列化並發送回父進程。 您需要組合來自單個工蜂的結果，以獲得與您的原始代碼計算相同的結果。

或者，您可能更喜歡將來自助手的結果序列化為唯一命名的文件。

Python 多處理循環/其他可迭代對象

問題描述

TT和CT：

TTL 和 CTL：

2 個解決方案

解決方案1
1 已采納 2019-08-17 23:28:53

解決方案2
0 2019-08-17 19:21:29

Python 多處理循環/其他可迭代對象

問題描述

TT和CT：

TTL 和 CTL：

2 個解決方案

解決方案1 1 已采納 2019-08-17 23:28:53

解決方案2 0 2019-08-17 19:21:29

解決方案1
1 已采納 2019-08-17 23:28:53

解決方案2
0 2019-08-17 19:21:29