提升 Pandas 迭代性能

Question

我有以下代碼，它獲取單個資產的歷史價格並計算預測，並計算如果您真的根據預測投資您的資金，您將如何公平。 用金融術語來說，這是一個回溯測試。

主要問題是它非常慢，我不確定改進它的正確策略是什么。 我需要運行數千次，因此需要一個數量級的加速。

我應該從哪里開始尋找？

class accountCurve():
    def __init__(self, forecasts, prices):

        self.curve = pd.DataFrame(columns=['Capital','Holding','Cash','Trade', 'Position'], dtype=float)
        forecasts.dropna(inplace=True)
        self.curve['Forecast'] = forecasts
        self.curve['Price'] = prices
        self.curve.loc[self.curve.index[0],['Capital', 'Holding', 'Cash', 'Trade', 'Position']] = [10000, 0, 10000, 0, 0]

        for date, forecast in forecasts.iteritems():
            x=self.curve.loc[date]
            previous = self.curve.shift(1).loc[date]
            if previous.isnull()['Cash']==False:
                x['Cash'] = previous['Cash'] - previous['Trade'] * x['Price']
                x['Position'] = previous['Position'] + previous['Trade']

            x['Holding'] = x['Position'] * x['Price']
            x['Capital'] = x['Cash'] + x['Holding']
            x['Trade'] = np.fix(x['Capital']/x['Price'] * x['Forecast']/20) - x['Position']

編輯：

要求的數據集：

價格：

import quandl
corn = quandl.get('CHRIS/CME_C2')
prices = corn['Open']

預測：

def ewmac(d):
    columns = pd.Series([2, 4, 8, 16, 32, 64])
    g = lambda x: d.ewm(span = x, min_periods = x*4).mean() - d.ewm(span = x*4, min_periods=x*4).mean()
    f = columns.apply(g).transpose()
    f = f*10/f.abs().mean()
    f.columns = columns
    return f.clip(-20,20)
forecasts=ewmac(prices)

Answer 1

我建議在for循環中使用 numpy 數組而不是數據框。 它通常會顯着提升速度。

所以代碼可能如下所示：

class accountCurve():
    def __init__(self, forecasts, prices):
        self.curve = pd.DataFrame(columns=['Capital','Holding','Cash','Trade', 'Position'], dtype=float)
        # forecasts.dropna(inplace=True)
        self.curve['Forecast'] = forecasts.dropna()
        self.curve['Price'] = prices
        # helper np.array:
        self.arr = np.array(self.curve)
        self.arr[0,:5] = [10000, 0, 10000, 0, 0]

        for i in range(1, self.arr.shape[0]):
            this = self.arr[i]
            prev = self.arr[i-1]
            cash = prev[2] - prev[3] * this[6]
            position = ...
            holding = ...
            capital = ...
            trade = ...
            this[:5] = [capital, holding, cash, trade, position]

        # back to data frame:
        self.curve[['Capital','Holding','Cash','Trade', 'Position']] = self.arr[:,:5]
        # or maybe this would be faster:
        # self.curve[:] = self.arr

我不太明白if previous.isnull()['Cash']==False: 。 看起來好像previous['Cash']從來沒有為空，除了第一行 - 但你更早設置了第一行。

此外，您可以考慮在課堂之外執行forecasts.dropna(inplace=True) 。 如果它最初是一個數據框，您將運行它一次，而不是為每一列重復它。 （我是否正確理解您將單列forecasts輸入到課程中？）

我建議的下一步是使用一些行分析器來查看您的代碼大部分時間花在哪里，並嘗試優化這些瓶頸。 如果您使用 ipython，那么您可以嘗試運行%prun或%lprun 。 例如

%lprun -f accountCurve.__init__  A = accountCurve(...)

將為您的__init__每一行生成統計信息。

提升 Pandas 迭代性能

問題描述

1 個解決方案

解決方案1
1 已采納 2016-05-05 19:58:03

提升 Pandas 迭代性能

問題描述

1 個解決方案

解決方案1 1 已采納 2016-05-05 19:58:03

解決方案1
1 已采納 2016-05-05 19:58:03