從pandas數據框中刪除“重疊”日期

Question

我有一個pandas數據框，如下所示：

ID  date       close
1   09/15/07   123.45
2   06/01/08   130.13
3   10/25/08   132.01
4   05/13/09   118.34
5   11/07/09   145.99
6   11/15/09   146.73
7   07/03/11   171.10

我想刪除任何重疊的行。

重疊行定義為另一行X天內的任何行。 例如，如果X = 365，則結果應為：

ID  date       close
1   09/15/07   123.45
3   10/25/08   132.01
5   11/07/09   145.99
7   07/03/11   171.10

如果X = 50，結果應為：

ID  date       close
1   09/15/07   123.45
2   06/01/08   130.13
3   10/25/08   132.01
4   05/13/09   118.34
5   11/07/09   145.99
7   07/03/11   171.10

我在這里看了幾個問題，但沒有找到正確的方法。 例如， Pandas檢查多行中的重疊日期，最快的方法來消除熊貓數據幀中的特定日期是相似的，但不能完全得到我需要的東西。

我今天有以下丑陋的代碼適用於小X值但是當X變大時（例如，當X = 365時），它會刪除除原始日期之外的所有日期。

filter_dates = []
for index, row in df.iterrows():
     if observation_time == 'D':
        for i in range(1, observation_period):
            filter_dates.append((index.date() + timedelta(days=i)))
df = df[~df.index.isin(filter_dates)]

任何幫助/指針將不勝感激！

澄清：

解決方案需要查看每一行，而不僅僅是第一行。

Answer 1

您可以添加新列來過濾結果：

df['filter'] = df['date'] - df['date'][0]
df['filter'] = df['filter'].apply(lambda x: x.days)

然后按365過濾使用此：

df[df['filter']%365==0]

Answer 2

我找到了另一個解決方案（如果你想查看舊的，可以查看編輯歷史記錄）。 這是我提出的最佳解決方案。 它仍保留第一個連續記錄，但可以進行調整以保持按時間順序排列的記錄（最后提供）。

target = df.iloc[0]  # Get the first item in the dataframe
day_diff = abs(target.date - df.date)  # Get the differences of all the other dates from the first item
day_diff = day_diff.reset_index().sort_values(['date', 'index'])  # Reset the index and then sort by date and original index so we can maintain the order of the dates
day_diff.columns = ['old_index', 'date']  # rename old index column because of name clash
good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values  # Group the dates by range and then get the first item from each group
df.iloc[good_ids]

我再次進行了一些測試，將其與QuickBeam的方法進行比較。 使用的DataFrame是隨機排序的600,000行和按日期排序的DataFrame，行數為73,000行：

我的方法：

DataFrame             days           time
600k/random            2             1 loop, best of 3: 5.03 s per loop
ordered                2             1 loop, best of 3: 564 ms per loop


600k/random            50            1 loop, best of 3: 5.17 s per loop
ordered                50            1 loop, best of 3: 583 ms per loo


600k/random            365           1 loop, best of 3: 5.16 s per loop
ordered                365           1 loop, best of 3: 577 ms per loop

QuickBeam的方法：

DataFrame             days           time

600k/random            2             1 loop, best of 3: 52.8 s per loop
ordered                2             1 loop, best of 3: 4.89 s per loop


600k/random            50            1 loop, best of 3: 53 s per loop
ordered                50            1 loop, best of 3: 4.53 s per loop

600k/random            365           1 loop, best of 3: 53.7 s per loop
ordered                365           1 loop, best of 3: 4.49 s per loop

所以，是的，也許我有點競爭力......

用於測試的確切函數：

def my_filter(df, days):
    target = df.iloc[0]
    day_diff = abs(target.date - df.date)
    day_diff = day_diff.reset_index().sort_values(['date', 'index'])
    day_diff.columns = ['old_index', 'date']
    good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().old_index.values
    return df.iloc[good_ids]

def quickbeam_filter(df, days):
    filter_ids = [0]
    last_day = df.loc[0, "date"]
    for index, row in df[1:].iterrows():
         if (row["date"] - last_day).days > days:
             filter_ids.append(index)
             last_day = row["date"]
    return df.loc[filter_ids,:]

如果你想獲得在某個范圍內開始的所有日期，這對我來說更有意義，你可以使用這個版本：

def my_filter(df, days):
    target = df.iloc[0]
    day_diff = abs(target.date - df.date)
    day_diff = day_diff.sort_values('date')
    good_ids = day_diff.groupby(day_diff.date.dt.days // days).first().index.values
    return df.iloc[good_ids]

Answer 3

我的方法是首先計算距離矩陣

distM = np.array([[np.timedelta64(abs(x-y),'D').astype(int) for y in df.date] for x in df.date])

在你的例子中，這將是這樣的

[[   0  260  406  606  784  792 1387]
 [ 260    0  146  346  524  532 1127]
 [ 406  146    0  200  378  386  981]
 [ 606  346  200    0  178  186  781]
 [ 784  524  378  178    0    8  603]
 [ 792  532  386  186    8    0  595]
 [1387 1127  981  781  603  595    0]]

由於向下迭代，我們只關心與頂部三角形的距離，因此我們通過保持頂部並將365的最小值設置為大數M來修改數組，在這種情況下，我將使用10,000

distM[np.triu(distM) <= 365] = 10000

然后將argmin跨越新的距離矩陣，以確定要保留的數據幀的哪些行。

remove = np.unique(np.argmin(distM,axis=1))
df = df.iloc[remove,:]

一起......

distM = np.array([[np.timedelta64(abs(x-y),'D').astype(int) for y in df.date] for x in df.date])

distM[np.triu(distM)<= 365] = 10000

remove = np.unique(np.argmin(distM,axis=1))

df = df.iloc[remove,:]

Answer 4

我剛剛使用了一種基本方法（基本上它是OP方法的調整版本），沒有花哨的numpy或pandas ops，但是線性而不是二次復雜度（當符合距離矩陣方法時）。
但是（作為Cory Madden），我假設數據是根據日期列進行排序的。 我希望這是正確的：

Dataframe - >我在這里使用pandas索引：

import pandas as pd
df = pd.DataFrame({'date': ["2007-09-15","2008-06-01","2008-10-25",
                            "2009-05-13","2009-11-07", "2009-11-15", "2011-07-03"],
                   'close':[123.45, 130.13, 132.01, 118.34, 
                            145.99, 146.73, 171.10]})
df["date"]=pd.to_datetime(df["date"])

下面的代碼塊可以很容易地在函數中包裝並為X = 365編譯正確的數據幀索引：

X = 365
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
     if (row["date"] - last_day).days > X:
         filter_ids.append(index)
         last_day = row["date"]

結果：

print(df.loc[filter_ids,:])
    close       date
0  123.45 2007-09-15
2  132.01 2008-10-25
4  145.99 2009-11-07
6  171.10 2011-07-03

請注意，由於索引從零開始，索引會移動一個。

我只是想評論線性與四次復雜度我的解決方案具有線性時間復雜度，只看到數據幀的每一行一次。 Cory maddens解決方案具有二次復雜度：在每次迭代中，訪問數據幀的每一行。 但是，如果X（日差）很大，我們可能會丟棄大部分數據集，只執行很少的迭代。

為此，人們可能想要考慮以下最壞情況X=2的數據集：

df = pd.DataFrame({'date':pd.date_range(start='01.01.1900', end='01.01.2100', freq='D')})

在我的機器上，以下代碼產生：

%%timeit
X = 2
filter_ids = [0]
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
    if (row["date"] -last_day).days > X:
        filter_ids.append(index)
        last_day = row["date"]
1 loop, best of 3: 7.06 s per loop

和

day_diffs = abs(df.iloc[0].date - df.date).dt.days
i = 0
days = 2
idx = day_diffs.index[i]
good_ids = {idx}
while True:
    try:
        current_row = day_diffs[idx] 
        day_diffs = day_diffs.iloc[1:]
        records_not_overlapping = (day_diffs - current_row) > days         
        idx = records_not_overlapping[records_not_overlapping == True].index[0] 
        good_ids.add(idx)
except IndexError:  
    break
1 loop, best of 3: 3min 16s per loop

Answer 5

對於那些尋找適合我的答案的人來說，這是（基於@ Quickbeam2k1的回答）：

X = 50 #or whatever value you want
remove_ids = []
last_day = df.loc[0, "date"]
for index, row in df[1:].iterrows():
    if np.busday_count(last_day, df.loc[index, "date"]) < X: 
        remove_ids.append(index)
    else:
        last_day = df.loc[index, "date"]

從pandas數據框中刪除“重疊”日期

問題描述

5 個解決方案

解決方案1
3 2017-08-10 15:13:54

解決方案2
2 2017-08-10 14:31:46

解決方案3
1 2017-08-10 16:34:05

解決方案4
0 已采納 2017-08-10 18:01:35

解決方案5
0 2017-08-13 14:35:19

從pandas數據框中刪除“重疊”日期

問題描述

5 個解決方案

解決方案1 3 2017-08-10 15:13:54

解決方案2 2 2017-08-10 14:31:46

解決方案3 1 2017-08-10 16:34:05

解決方案4 0 已采納 2017-08-10 18:01:35

解決方案5 0 2017-08-13 14:35:19

解決方案1
3 2017-08-10 15:13:54

解決方案2
2 2017-08-10 14:31:46

解決方案3
1 2017-08-10 16:34:05

解決方案4
0 已采納 2017-08-10 18:01:35

解決方案5
0 2017-08-13 14:35:19