帶有滾動窗口的Python Pandas drop_duplicates

Question

我有一個帶有日期時間索引和3列（a，b，c）的熊貓數據框（約500,000行）：

                           a       b        c
2016-03-30 09:59:36.619    0       55       0
2016-03-30 09:59:41.979    0       20       0
2016-03-30 09:59:41.986    0       1        0
2016-03-30 09:59:45.853    0       1        3
2016-03-30 09:59:51.265    0       20       9
2016-03-30 10:00:03.273    0       55       26
2016-03-30 10:00:05.658    0       55       28
2016-03-30 10:00:17.416    0       156      0
2016-03-30 10:00:17.928    0       122      1073
2016-03-30 10:00:21.933    0       122      0
2016-03-30 10:00:31.937    0       122      10
2016-03-30 10:00:40.941    0       122      0
2016-03-30 10:00:51.147    10      2        0
2016-03-30 10:01:27.060    0       156      0

我想在10分鍾的滾動窗口中進行搜索，並從其中一列（b列）中刪除重復項，以獲得如下信息：

                           a       b        c
2016-03-30 09:59:36.619    0       55       0
2016-03-30 09:59:41.979    0       20       0
2016-03-30 09:59:41.986    0       1        0
2016-03-30 09:59:51.265    0       20       9
2016-03-30 10:00:03.273    0       55       26
2016-03-30 10:00:17.416    0       156      0
2016-03-30 10:00:17.928    0       122      1073
2016-03-30 10:00:51.147    10      2        0
2016-03-30 10:01:27.060    0       156      0

我drop_duplicates了將drop_duplicates與rolling_apply一起使用，但是這兩個函數不能很好地配合使用，即：

pd.rolling_apply(df, '10T', lambda x:x.drop_duplicates(subset='b'))

引發錯誤，因為該函數必須返回一個值，而不是df。 所以這就是我到目前為止：

import datetime as dt
windows = []
for ind in range(len(df)):
    t0 = df.index[ind]
    t1 = df.index[ind]+dt.timedelta(minutes=10)

    windows.append(df[numpy.logical_and(t0<df.index,\
    df.index<=t1)].drop_duplicates(subset='b'))

在這里，我最后列出了一個10分鍾的數據幀，其中刪除了重復項，但是隨着窗口滾動到下一個10分鍾的段，有很多重疊的值。 為了保持唯一的值，我嘗試了類似的方法：

new_df = []
for ind in range(len(windows)-1):
    new_df.append(pd.unique(pd.concat([pd.Series(windows[ind].index),\
    pd.Series(windows[ind+1].index)])))

但這是行不通的，並且已經開始變得混亂。 有誰有聰明的主意如何盡可能有效地解決這個問題？

提前致謝。

Answer 1

我希望這是有用的。 我滾動了一個函數，該函數檢查最后一個值是否在10分鍾的窗口內是先前元素的重復項。 結果可以與布爾索引一起使用。

# Simple example
dates = pd.date_range('2017-01-01', periods = 5, freq = '4min')
col1 = [1, 2, 1, 3, 2]
df = pd.DataFrame({'col1':col1}, index = dates)

# Make function that checks if last element is a duplicate
def last_is_duplicate(a):
    if len(a) > 1:
        return a[-1] in a[:len(a)-1]
    else: 
        return False    

# Roll over 10 minute window to find duplicates of recent elements
dup = df.col1.rolling('10T').apply(last_is_duplicate).astype('bool')

# Keep only those rows for which col1 is not a recent duplicate
df[~dup]

帶有滾動窗口的Python Pandas drop_duplicates

問題描述

1 個解決方案

解決方案1
0 2018-03-23 19:56:15

帶有滾動窗口的Python Pandas drop_duplicates

問題描述

1 個解決方案

解決方案1 0 2018-03-23 19:56:15

解決方案1
0 2018-03-23 19:56:15