如何提高 Python 循環的性能？

Question

我有一個近 1400 萬行的 DataFrame。 我正在處理金融期權數據，理想情況下，我需要根據每個期權的到期時間為每個期權設定一個利率（稱為無風險利率）。 根據我所關注的文獻，一種方法是獲取美國國債利率，並且對於每個期權，檢查到期時間最接近期權到期時間的國債利率是多少（絕對條款）。 為了實現這一點，我創建了一個循環，用這些差異填充 Dataframe。 我的代碼遠非優雅，而且有點混亂，因為日期和到期日的組合沒有費率。 因此循環內的條件。 循環完成后，我可以查看絕對差值最小的期限，然后選擇該期限的利率。 該腳本運行時間太長，以至於我添加了 tqdm 以對正在發生的事情進行某種反饋。

我嘗試運行代碼。 這需要幾天時間才能完成，並且隨着迭代的增加它正在減慢（我從 tqdm 知道這一點）。 起初，我使用 DataFrame.loc 向差異 DataFrame 添加行。 但正如我認為這是代碼隨着時間推移變慢的原因，我切換到 DataFrame.append。 代碼仍然很慢並且隨着時間的推移而變慢。

我搜索了一種提高性能的方法，發現了這個問題： How to speed up python loop 。 有人建議使用 Cython，但老實說，我仍然認為自己是 Python 的初學者，所以從查看示例來看，這似乎不是一件容易的事。 那是我最好的選擇嗎？ 如果需要很多時間來學習，我也可以像其他人在文獻中所做的那樣，只對所有選項使用 3 個月的利率。 但我不想在那里 go 那里。 也許我的問題還有其他（簡單的）答案，請告訴我。 我包含了一個可重現的代碼示例（盡管只有 2 行數據）：

from tqdm import tqdm
import pandas as pd


# Treasury maturities, in years
treasury_maturities = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]

# Useful lists
treasury_maturities1 = [3/12, 6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities2 = [1/12]
treasury_maturities3 = [6/12, 1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities4 = [1, 2, 3, 5, 7, 10, 20, 30]
treasury_maturities5 = [1/12, 2/12, 3/12, 6/12, 1, 2, 3, 5, 7, 10, 20]

# Dataframe that will contain the difference between the time to maturity of option and the different maturities
differences = pd.DataFrame(columns = treasury_maturities)


# Options Dataframe sample
options_list = [[pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "c",    309.1, 311.1, 1108.49, 1108.49, 0.0410958904109589, 310.1], [pd.to_datetime("2004-01-02"), pd.to_datetime("2004-01-17"), 800.0, "p", 0.0, 0.05, 1108.49, 1108.49, 0.0410958904109589, 0.025]]

options = pd.DataFrame(options_list, columns = ['QuoteDate', 'expiration', 'strike', 'OptionType', 'bid_eod', 'ask_eod', 'underlying_bid_eod', 'underlying_ask_eod', 'Time_to_Maturity', 'Option_Average_Price'])


# Loop
for index, row in tqdm(options.iterrows()):
    if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2018-10-15"):
        if pd.to_datetime("2004-01-02") <= row.QuoteDate <= pd.to_datetime("2006-02-08") and row.Time_to_Maturity > 25:
            list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities5])
            list_s = [list_s + [40]] # 40 is an arbitrary number bigger than 30
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True) 
        elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 1.5/12 <= row.Time_to_Maturity <= 3.5/12:
            list_s = [0, 40, 40]
            list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for 
                                   maturity in treasury_maturities3]]
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
        elif (pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24")) == row.QuoteDate and 3.5/12 < row.Time_to_Maturity <= 4.5/12:    
            list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2])
            list_s = list_s + [40, 40, 0]
            list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for 
                                   maturity in treasury_maturities4]]
            differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
        else:
            if 1.5/12 <= row.Time_to_Maturity <= 2/12:
                list_s = [0, 40]
                list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities1]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
            elif 2/12 < row.Time_to_Maturity <= 2.5/12:
                list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities2])
                list_s = list_s + [40, 0]
                list_s = [list_s + [abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities3]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
            else:
                list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities]]
                differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)
    else:        
        list_s = [[abs(maturity - row.Time_to_Maturity) for maturity in 
              treasury_maturities]]
        differences = differences.append(pd.DataFrame(list_s, 
                        columns = treasury_maturities), ignore_index = True)

Answer 1

簡答

循環和 if 語句都是計算量大的操作，因此請尋找減少使用次數的方法。

循環優化： - 加速編程循環的最佳方法是將盡可能多的計算移出循環。

干燥： - 不要重復自己。 你有幾個冗余的 if 條件，查看嵌套的 if 條件並遵循 DRY 原則。

使用 pandas 和 numpy

One of the main benefits of libraries such as pandas and numpy is that they are designed for efficiency in mathematical operations on arrays (see Why are numpy arrays so fast? ). 這意味着您通常根本不必使用循環。 不要在循環中創建新的 DataFrame，而是為您正在計算的每個值創建一個新列。

為了克服不同日期等的不同邏輯問題，過濾行並應用邏輯，對 select 使用掩碼/過濾器，僅對需要操作的行而不是使用 if 語句（參見pandas 過濾教程）。

代碼示例

此代碼不是您的邏輯的復制，而是如何實現它的示例。 它並不完美，但應該提供一些重大的效率改進。

import pandas as pd
import numpy as np

# Maturity periods, months and years
month_periods = np.array([1, 2, 3, 6, ], dtype=np.float64)
year_periods = np.array([1, 2, 3, 4, 5, 7, 10, 20, 30, ], dtype=np.float64)

# Create column names for maturities
maturity_cols = [f"month_{m:02.0f}" for m in month_periods] + [f"year_{y:02.0f}" for y in year_periods]

# Normalise months  & concatenate into single array
month_periods = month_periods / 12
maturities = np.concatenate((month_periods, year_periods))

# Create some dummy data
np.random.seed(seed=42)  # Seed PRN generator
date_range = pd.date_range(start="2004-01-01", end="2021-01-30", freq='D')  # Dates to sample from
dates = np.random.choice(date_range, size=n_records, replace=True)
maturity_times = np.random.random(size=n_records)
options = pd.DataFrame(list(zip(dates, maturity_times)), columns=['QuoteDate', 'Time_to_Maturity', ])

# Create date masks
after = options['QuoteDate'] >= pd.to_datetime("2008-01-01")
before = options['QuoteDate'] <= pd.to_datetime("2015-01-01")

# Combine date masks / create flipped version
between = after & before
outside = np.logical_not(between)

# Select data with masks
df_outside = options[outside].copy()
df_between = options[between].copy()

# Smaller dataframes
df_a = df_between[df_between['Time_to_Maturity'] > 25].copy()
df_b = df_between[df_between['Time_to_Maturity'] <= 3.5 / 12].copy()
df_c = df_between[df_between['Time_to_Maturity'] <= 4.5 / 12].copy()
df_d = df_between[
    (df_between['Time_to_Maturity'] >= 2 / 12) & (df_between['Time_to_Maturity'] <= 4.5 / 12)].copy()

# For each maturity period, add difference column using different formula
for i, col in enumerate(maturity_cols):
    # Add a line here for each subset / chunk of data which requires a different formula
    df_a[col] = ((maturities[i] - df_outside['Time_to_Maturity']) + 40).abs()
    df_b[col] = ((maturities[i] - df_outside['Time_to_Maturity']) / 2) .abs()
    df_c[col] = (maturities[i] - df_outside['Time_to_Maturity'] + 1).abs()
    df_d[col] = (maturities[i] - df_outside['Time_to_Maturity'] * 0.8).abs()
    df_outside[col] = (maturities[i] - df_outside['Time_to_Maturity']).abs()

# Concatenate dataframes back to one dataset
frames = [df_outside, df_a, df_b, df_c, df_d, ]
output = pd.concat(frames).dropna(how='any')

output.head()

記錄數的平均執行時間
甚至數百萬條記錄也被快速處理（內存允許） | 記錄 | 舊時（秒） | 新時間（秒）| 改進 | |-|-|-|-| | 10 | 0.0105 | 0.0244 | -132.38% | | 100 | 0.1078 | 0.0249 | 76.90% | | 1,000 (1k) | 1.03 | 0.0249 | 97.58% | | 10,000 (10k) | 15.629 | 0.0322 | 99.79% | | 100,000 (100k) | 182.014 | 0.065 | 99.96% | | 1,000,000 (1m) |? | 0.4014 |？ | | 10,000,000 (10m) |? | 4.7488 |？ | | 14,000,000 (14m) |? | 6.0172 |？ | | 100,000,000 (100m) |? | 83.286 |？ |

進一步優化

優化和分析基本代碼后，您還可以研究多線程、並行代碼或使用不同的語言。 此外，1400 萬條記錄會占用大量 RAM - 遠遠超過大多數工作站的處理能力。 要解決此限制，您可以分塊讀取文件本身並一次對一個塊執行計算：

result_frames = []
for chunk in pd.read_csv("voters.csv", chunksize=10000):
    # Do things here
    result = chunk
    result_frames.append(result)

谷歌搜索詞：多處理/線程/Dask/PySpark

Answer 2

對於您的問題，“分而治之”可以指導您解決問題。 我建議將您的代碼分成塊並分析每個部分，因為，我看到一些像這樣的冗余：

(pd.to_datetime("2008-12-10") or pd.to_datetime("2008-12-18") or pd.to_datetime("2008-12-24"))

似乎在每一行都完成了從字符串到日期時間的轉換。 您必須使用配置文件或更具體的工具（如perf_tool [*]）來分析您的代碼。 它通過在代碼中放置一些哨兵並報告所有中間時間、調用次數、方法來幫助您。

[*] 我是主要開發者

Answer 3

正如其他人已經指出的那樣，請分析您的代碼以找到最慢的部分。

一些可能的加速：

盡可能考慮使用生成器而不是列表。 此外，也許使用 list.extend 可能比列表連接更快。

list_s = ([abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2)

可

list_s = (abs(maturity - row.Time_to_Maturity) for maturity in 
                           treasury_maturities2)

和

list_s = list_s + [foo, bar, baz]

可

list_s = list_s.extend([foo, bar, baz])

如何提高 Python 循環的性能？

問題描述

3 個解決方案

解決方案1
2 2021-04-10 12:53:35

簡答

使用 pandas 和 numpy

代碼示例

進一步優化

解決方案2
1 2021-03-22 16:06:40

解決方案3
0 2021-04-10 12:03:47

如何提高 Python 循環的性能？

問題描述

3 個解決方案

解決方案1 2 2021-04-10 12:53:35

簡答

使用 pandas 和 numpy

代碼示例

進一步優化

解決方案2 1 2021-03-22 16:06:40

解決方案3 0 2021-04-10 12:03:47

解決方案1
2 2021-04-10 12:53:35

解決方案2
1 2021-03-22 16:06:40

解決方案3
0 2021-04-10 12:03:47