如何使用多線程或並行處理來減少運行時間？

Question

我有嵌套循環，想用每個內循環值檢查外循環值的總和。 我得到了想要的結果，但需要幾個小時。 有什么辦法可以減少時間。

我正在使用df.iterrows()遍歷所有行。 df1 大小為 100 萬，df2 大小為 1000。

如果時間可以減少到 5-10 分鍾甚至更少，那將非常有幫助，因為每天都需要重復相同的工作。

這是數據框的樣子：

df1......
       col1      col2  NEWVALUE
0  0.727900  0.007912       NaN
1  0.249418  0.087288       NaN
2  0.592969  0.443518       NaN
3  0.832903  0.101647       NaN
4  0.129666  0.321423       NaN
df2...
       col1      col2  OLDVALUE
0  0.176620  0.857886        43
1  0.758241  0.086826       609
2  0.855264  0.959226       388
3  0.929884  0.349760       137
4  0.693689  0.375171         0

這是代碼：

list_values = []
for idx, xitems in df1.iterrows():
    savVal = -1
    i = 99
    for idy, yitems in df2.iterrows():
        value = xitems[‘col1’] + xitems[‘col2’] + yitems[‘col1’] + yitems[‘col2’]
        #it only runs for the first time to store the value into savVal
        if savVal == -1:
            savVal = value

        else:
            if value <= 1 and value < savVal:
                savVal = value
                i = idy
                break
    if i == 99:
        #df1.iat[idx , ‘NEWVALUE’] = “LESSTHAN”
        #in case above code throws error then alternative is list
        list_values.append(“LESSTHAN”)
    else:
        #df1.iat[idx, ‘NEWVALUE’] = df2.loc[i, ‘OLDVALUE’]
        list_values.append(df2.loc[i, ‘OLDVALUE’])

Answer 1

正如評論中提到的，您應該盡量避免iterrows並從矩陣問題的角度考慮這一點。 我的第一步是分別計算每個數據幀的“col1”和“col2”的總和

df1["sum_col"] = df1["col1"] + df1["col2"]
df2["sum_col"] = df2["col1"] + df2["col2"]

然后可以將這些與一些numpy魔法相加，以獲得兩個數字的所有可能和

all_values = (df1["sum_col"].values[np.newaxis].T +
              df2["sum_col"].values[np.newaxis])

all_values現在將具有形狀(1000000, 1000) ，它是兩列的所有可能總和。

現在，下一部分是我不太清楚你想要做什么......所以如果我錯了，請糾正我。 在我看來，您正在將savVal設置為df2 (?) 每次迭代的第一個值，在這種情況下，它的形狀應該為 1000000，所以我們可以這樣做

sav_val = all_values[:, 0]

然后我們想要找到小於或等於 1 且小於sav_val內部循環的第一個（？）值。 讓我們分別看看是否滿足這些條件

less_than_one = np.less_equal(all_values, 1)

和

less_than_sav_val = np.less(all_values.T, sav_val).T

.T是轉置，可以幫助我們廣播到正確的形狀。

我們可以結合我們的兩個條件並使用argmax找到每行中的第一個True值（參見這個問題），如果沒有True值，我們將獲得每行中的第一個條目（索引 0）

passes_condition = less_than_one & less_than_sav_val
result = df2['OLDVALUE'].values.take(passes_condition.argmax(axis=1))

好的，差不多了。 result形狀為 1000000。我們現在可以用值 <= 1 和 < 第一次迭代替換那些沒有條目的條目。 我們現在將它們設置為-999 。

result[~passes_condition.any(axis=1)] = -999

result的形狀為 1000000

把這一切放在一起

def rajat_func(df1, df2):
    list_values = []
    for idx, xitems in df1.iterrows():
        savVal = -1
        i = 99
        for idy, yitems in df2.iterrows():
            value = xitems['col1'] + xitems['col2'] + yitems['col1'] + yitems['col2']
            #it only runs for the first time to store the value into savVal
            if savVal == -1:
                savVal = value
            else:
                if value <= 1 and value < savVal:
                    savVal = value
                    i = idy
                    break
        if i == 99:
            #df1.iat[idx , ‘NEWVALUE’] = “LESSTHAN”
            #in case above code throws error then alternative is list
            list_values.append(-999)
        else:
            #df1.iat[idx, ‘NEWVALUE’] = df2.loc[i, ‘OLDVALUE’]
            list_values.append(df2.loc[i, 'OLDVALUE'])
    return list_values

def new_func(df1, df2):
    x = (df1["col1"] + df1["col2"]).values
    y = (df2["col1"] + df2["col2"]).values
    all_values = (x[np.newaxis].T + y[np.newaxis])
    sav_val = all_values[:, 0]
    less_than_one = np.less_equal(all_values, 1)
    less_than_sav_val = np.less(all_values.T, sav_val).T
    passes_condition = less_than_one & less_than_sav_val
    result = df2['OLDVALUE'].values.take(passes_condition.argmax(axis=1))
    result[~passes_condition.any(axis=1)] = -999
    return result

使用 1000 行的df1和 100 行的df2進行測試。

all(new_func(df1, df2) == rajat_func(df1, df2))

是真的。

%timeit rajat_func(df1, df2)

給

5.07 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit new_func(df1, df2)

給

601 µs ± 17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

所以相當大的改進！ 使用具有 1,000,000 行的df1和具有 1000 行的df2在new_func上運行%time給出

CPU times: user 4.9 s, sys: 3.05 s, total: 7.96 s
Wall time: 7.99 s

這是否解決了您的問題，還是我完全誤解了您要做什么？

如何使用多線程或並行處理來減少運行時間？

問題描述

1 個解決方案

解決方案1
1 已采納 2019-08-28 13:14:15

把這一切放在一起

如何使用多線程或並行處理來減少運行時間？

問題描述

1 個解決方案

解決方案1 1 已采納 2019-08-28 13:14:15

把這一切放在一起

解決方案1
1 已采納 2019-08-28 13:14:15