加快熊貓行動

Question

我正在對包含很多行的pandas數據框執行操作，因此操作變得太慢。 我想知道是否有一種方法可以對其進行優化。 假設我在數據框上有下一個數據：

         date        X
2019/5/1 10:00:00    1
2019/5/1 11:00:00    3
2019/5/1 12:00:00    5 
2019/5/1 13:00:00    2
2019/5/1 14:00:00    4 
2019/5/2 11:00:00    3
2019/5/2 12:00:00    2

我的代碼所做的是檢查是否為給定的x上的一排i ，值x上的行i-1是不是x該行的值越大i+1 ，只要它們是從同一行。 它會創建一個名為offset的新列，其中值是-1 ，其中前一條語句為true，否則為0，並且還會更新日期，將其減少1小時。 編碼：

for index, row in islice(df.iterrows(), 1, len(df.index)-1):
                if row.date.day == day:
                    if df.x[index-1] > df.x[index+1] or row.date.hour == 23:
                        df.offset[index] = -1
                        df.date[index] = df.date[index] - dt.timedelta(hours=1)
                else:
                    day = row.date.day

所需的輸出將是這樣的：

       date          X    offset
2019/5/1 10:00:00    1     0
2019/5/1 11:00:00    3     0
2019/5/1 11:00:00    5     -1
2019/5/1 12:00:00    2     -1
2019/5/1 14:00:00    4     0      <---Note that on this row, the next one is from a new day, so we dont use on comparision
2019/5/2 11:00:00    3     0
2019/5/2 11:00:00    2     -1

*請注意時間上的差異。

對於一個大約有15K行和4列的文件，此操作大約需要10分鍾。 我如何加快速度？

謝謝

編輯：忘了提。 這些行必須是同一天的數據，否則，將無法進行比較。 另外，如果該行是文件的最后一天還是一天的最后一天（23:00:00），則偏移量始終為-1，因為在此之后沒有可比較的內容。

Answer 1

這是一種方法：

# date column to datatime format
df.date = pd.to_datetime(df.date)
# compare with shifted version, 2 samples away
s = df.X.gt(df.X.shift(-2)).shift().fillna(False)
# turn series of booleans to 0s and -1s
df['offset'] = s.mul(-1)
# last sample in offset to -1
df.loc[df.shape[0]-1, 'offset'] -= 1
# subtract 1h using the same offset column
df.date += pd.to_timedelta(df.offset, unit='h')

       date            X    offset
0 2019-05-01 10:00:00  1       0
1 2019-05-01 11:00:00  3       0
2 2019-05-01 11:00:00  5      -1
3 2019-05-01 12:00:00  2      -1
4 2019-05-01 14:00:00  3       0
5 2019-05-02 11:00:00  5       0
6 2019-05-02 11:00:00  4      -1

Answer 2

我們屏蔽上一行X的值大於下一行X的值的行。
我們有條件地創建掩碼列為true的offset列，我們填寫-1 else 0
我們對date列執行相同的操作：如果掩碼為True，則減去1 hour

m = df['X'].shift() > df['X'].shift(-1)

df['offset'] = np.where(m, -1, 0)
df['date'] = np.where(m, df['date'] - pd.Timedelta(1, 'hour'), df['date'])

                 date  X  offset
0 2019-05-01 10:00:00  1       0
1 2019-05-01 11:00:00  3       0
2 2019-05-01 11:00:00  5      -1
3 2019-05-01 12:00:00  2      -1
4 2019-05-01 14:00:00  3       0
5 2019-05-02 11:00:00  5       0
6 2019-05-02 12:00:00  4       0

請注意 ，在最后一行沒有更改，因為它無法與下面的行進行比較

加快熊貓行動

問題描述

2 個解決方案

解決方案1
2 已采納 2019-07-04 13:13:44

解決方案2
1 2019-07-04 13:13:52

加快熊貓行動

問題描述

2 個解決方案

解決方案1 2 已采納 2019-07-04 13:13:44

解決方案2 1 2019-07-04 13:13:52

解決方案1
2 已采納 2019-07-04 13:13:44

解決方案2
1 2019-07-04 13:13:52