簡體   English   中英

Pandas:根據另一個DF選擇DF行

[英]Pandas: select DF rows based on another DF

我有兩個數據幀(很長,每個有數百或數千行)。 其中一個名為df1 ,包含一個時間序列,間隔為10分鍾。 例如:

date          value
2016-11-24 00:00:00    1759.199951
2016-11-24 00:10:00     992.400024
2016-11-24 00:20:00    1404.800049
2016-11-24 00:30:00      45.799999
2016-11-24 00:40:00      24.299999
2016-11-24 00:50:00     159.899994
2016-11-24 01:00:00      82.499999
2016-11-24 01:10:00      37.400003
2016-11-24 01:20:00     159.899994
....

而另一個, df2 ,包含日期時間間隔:

start_date             end_date
0    2016-11-23 23:55:32  2016-11-24 00:14:03
1    2016-11-24 01:03:18  2016-11-24 01:07:12
2    2016-11-24 01:11:32  2016-11-24 02:00:00 
...

我需要選擇df1中的所有行,這些行“落入” df2一個區間。

通過這些示例,結果數據框應為:

date          value
2016-11-24 00:00:00    1759.199951   # Fits in row 0 of df2
2016-11-24 00:10:00     992.400024   # Fits in row 0 of df2
2016-11-24 01:00:00      82.499999   # Fits in row 1 of df2
2016-11-24 01:10:00      37.400003   # Fits on row 2 of df2
2016-11-24 01:20:00     159.899994   # Fits in row 2 of df2
....

使用np.searchsorted

這是基於np.searchsorted的變體,似乎比使用intervaltreemerge快一個數量級,假設我的更大的樣本數據是正確的。

# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])

# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
    df1['date'].values <= s1['end_date'].values,
    df1['date_end'].values <= s2['end_date'].values,
    s1.index.values != s2.index.values
    ]

# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)

如果df2中的間隔嵌套或重疊,則可能需要修改此df2 ; 在那種情況下,我還沒有完全考慮過,但它仍然可以工作。

使用間隔樹

不是一個純粹的Pandas解決方案,但您可能需要考慮從df2構建一個Interval Tree ,並根據df1中的間隔查詢它以找到重疊的。

PyPI上的intervaltree包似乎具有良好的性能和易於使用的語法。

from intervaltree import IntervalTree

# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

出於性能原因,我將日期轉換為等價的整數。 我懷疑intervaltree包是用pd.Timestamp構建的,所以可能有一些中間轉換步驟會讓事情變慢。

另請注意,雖然包含起點,但intervaltree包中的intervaltree不包括終點。 這就是我在創建tree時得到+ [0, 1]的原因; 我將終點填充一個納秒,以確保實際包含真正的終點。 這也是為什么我可以在查詢樹時添加pd.offsets.Minute(10)以獲得間隔結束,而不是僅添加9m 59s。

兩種方法的結果輸出:

                 date        value
0 2016-11-24 00:00:00  1759.199951
1 2016-11-24 00:10:00   992.400024
6 2016-11-24 01:00:00    82.499999
7 2016-11-24 01:10:00    37.400003
8 2016-11-24 01:20:00   159.899994

計時

使用以下設置生成更大的樣本數據:

# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})

# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})

# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2

這為df1df2產生以下結果:

df1
                  date     value
0     2016-11-24 00:00:00  0.444939
1     2016-11-24 00:10:00  0.407554
2     2016-11-24 00:20:00  0.460148
3     2016-11-24 00:30:00  0.465239
4     2016-11-24 00:40:00  0.462691
...
54995 2017-12-10 21:50:00  0.754123
54996 2017-12-10 22:00:00  0.401820
54997 2017-12-10 22:10:00  0.146284
54998 2017-12-10 22:20:00  0.394759
54999 2017-12-10 22:30:00  0.907233

df2
              start_date            end_date
0   2016-11-24 00:00:19 2016-11-24 00:41:24
1   2016-11-24 18:22:44 2016-11-24 18:36:44
2   2016-11-25 12:44:44 2016-11-25 13:03:13
3   2016-11-26 07:07:05 2016-11-26 07:49:29
4   2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53

並使用以下函數進行計時:

def root_searchsorted(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # Build the conditions that indicate an overlap (any True condition indicates an overlap).
    cond = [
        df1['date'].values <= s1['end_date'].values,
        df1['date_end'].values <= s2['end_date'].values,
        s1.index.values != s2.index.values
        ]

    # Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
    return df1[np.any(cond, axis=0)].drop('date_end', axis=1)

def root_intervaltree(df1, df2):
    # Build the Interval Tree.
    tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])

    # Build the 10 minutes spans from df1.
    dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)

    # Query the Interval Tree to filter the DataFrame.
    return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]

def ptrj(df1, df2):
    # The smallest amount of time - handy when using open intervals:
    epsilon = pd.Timedelta(1, 'ns')

    # Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
    sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
    edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values

    return df1[mask]

def parfait(df1, df2):
    df1['key'] = 1
    df2['key'] = 1
    df2['row'] = df2.index.values

    # CROSS JOIN
    df3 = pd.merge(df1, df2, on=['key'])

    # DF FILTERING
    return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

def root_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----
    # Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.
    return df1.drop('date_end', axis=1)

我得到以下時間:

%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop

%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop

%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop

%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop

%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop

這個解決方案(我相信它有效)使用pandas.Series.asof 在引擎蓋下,它是搜索的一些版本 - 但由於某種原因,它的速度比它與@ root的功能速度相當快四倍

我假設所有日期列都是pandas datetime格式,已排序,並且df2間隔不重疊

代碼很短但有點錯綜復雜(下面的解釋)。

# The smallest amount of time - handy when using open intervals: 
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)

# The main function (see explanation below):
def get_it(df1):
    # (filling NaN's with -1)
    l = edate.asof(df1.date).fillna(-1)
    r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
    # (taking `values` here to skip indexes, which are different)
    mask = l.values < r.values
    return df1[mask]

這種方法的優點是雙重的: sdateedate僅評估一次,如果df1非常大,則main函數可以獲取df1塊。

說明

pandas.Series.asof返回給定索引的最后一個有效行。 它可以將數組作為輸入並且非常快。

為了本說明起見,讓s[j] = sdate.index[j]是在第j個日期sdatex是一些任意日期(時間戳)。 總有s[sdate.asof(x)] <= x (這正是asof工作原理)並且不難顯示:

  1. j <= sdate.asof(x)當且僅當s[j] <= x
  2. sdate.asof(x) < j if且僅當x < s[j]

同樣適用於edate 不幸的是,我們不能在1和2中具有相同的不等式(周或嚴格)。

兩個區間[a,b)和[x,y]相交iff x <b和a <= y。 (我們可能會認為a,b來自sdate.indexedate.index - 因為屬性1和2而選擇間隔[a,b]閉合打開。)在我們的例子中,x是來自df1 ,y = x + 10min-epsilon,a = s [j],b = e [j](注意epsilon已被添加到edate ),其中j是某個數字。

因此,最后,相當於“[a,b)和[x,y]相交”的條件是“sdate.asof(x)<j和j <= edate.asof(y)對於某些數字j”。 它大致歸結為函數get_it l < r (以某些技術get_it為模)。

這並不簡單,但您可以執行以下操作:

首先從兩個數據框中獲取相關的日期列並將它們連接在一起,以便一列是所有日期,另外兩列是表示來自df2的索引的列。 (注意df2在堆疊后獲得多索引)

dfm = pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)

print(dfm)

                    0  level_0     level_1
0 2016-11-23 23:55:32      0.0  start_date
0 2016-11-24 00:00:00      NaN         NaN
1 2016-11-24 00:10:00      NaN         NaN
1 2016-11-24 00:14:03      0.0    end_date
2 2016-11-24 00:20:00      NaN         NaN
3 2016-11-24 00:30:00      NaN         NaN
4 2016-11-24 00:40:00      NaN         NaN
5 2016-11-24 00:50:00      NaN         NaN
6 2016-11-24 01:00:00      NaN         NaN
2 2016-11-24 01:03:18      1.0  start_date
3 2016-11-24 01:07:12      1.0    end_date
7 2016-11-24 01:10:00      NaN         NaN
4 2016-11-24 01:11:32      2.0  start_date
8 2016-11-24 01:20:00      NaN         NaN
5 2016-11-24 02:00:00      2.0    end_date

您可以看到df1中的值在右側兩列中具有NaN ,並且由於我們已對日期進行了排序,因此這些行位於start_dateend_date行之間(來自df2)。

為了表明df1中的行落在df2的行之間,我們可以插入level_0列,它給出了:

dfm['level_0'] = dfm['level_0'].interpolate()

                    0   level_0     level_1
0 2016-11-23 23:55:32  0.000000  start_date
0 2016-11-24 00:00:00  0.000000         NaN
1 2016-11-24 00:10:00  0.000000         NaN
1 2016-11-24 00:14:03  0.000000    end_date
2 2016-11-24 00:20:00  0.166667         NaN
3 2016-11-24 00:30:00  0.333333         NaN
4 2016-11-24 00:40:00  0.500000         NaN
5 2016-11-24 00:50:00  0.666667         NaN
6 2016-11-24 01:00:00  0.833333         NaN
2 2016-11-24 01:03:18  1.000000  start_date
3 2016-11-24 01:07:12  1.000000    end_date
7 2016-11-24 01:10:00  1.500000         NaN
4 2016-11-24 01:11:32  2.000000  start_date
8 2016-11-24 01:20:00  2.000000         NaN
5 2016-11-24 02:00:00  2.000000    end_date

請注意, level_0列現在包含介於開始日期和結束日期之間的行的整數(數學上,而不是數據類型)(這假定結束日期不會與下一個開始日期重疊)。

現在我們可以過濾出最初在df1中的行:

df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']

並與原始數據框合並

df_final = pd.merge(df1, right=df_falls, on='date', how='outer')

這使:

print(df_final)

                 date        value  falls_index
0 2016-11-24 00:00:00  1759.199951          0.0
1 2016-11-24 00:10:00   992.400024          0.0
2 2016-11-24 00:20:00  1404.800049          NaN
3 2016-11-24 00:30:00    45.799999          NaN
4 2016-11-24 00:40:00    24.299999          NaN
5 2016-11-24 00:50:00   159.899994          NaN
6 2016-11-24 01:00:00    82.499999          NaN
7 2016-11-24 01:10:00    37.400003          NaN
8 2016-11-24 01:20:00   159.899994          2.0

這與原始數據幀相同,附加falls_index ,表示該行所屬的df2中行的索引。

考慮交叉連接合並,它返回兩個集合之間的笛卡爾積(所有可能的行對M x N)。 您可以使用merge's on參數中的全1鍵列來交叉連接。 然后,使用pd.series.between()在大型返回集上運行過濾器。 具體來說, between()的系列保持開始日期在datedate的9:59范圍內的行落在開始和結束時間內。

但是,在合並之前,創建一個等於日期索引的df1['date']列,以便它在合並后可以是保留列並用於日期過濾。 另外,創建一個df2['row']列,在末尾用作行指示符。 對於演示,下面重新創建發布的df1和df2數據幀:

from io import StringIO
import pandas as pd
import datetime as dt

data1 = '''
date                     value
"2016-11-24 00:00:00"    1759.199951
"2016-11-24 00:10:00"     992.400024
"2016-11-24 00:20:00"    1404.800049
"2016-11-24 00:30:00"      45.799999
"2016-11-24 00:40:00"      24.299999
"2016-11-24 00:50:00"     159.899994
"2016-11-24 01:00:00"      82.499999
"2016-11-24 01:10:00"      37.400003
"2016-11-24 01:20:00"     159.899994
'''    
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values

data2 = '''
start_date  end_date
"2016-11-23 23:55:32"  "2016-11-24 00:14:03"
"2016-11-24 01:03:18"  "2016-11-24 01:07:12"
"2016-11-24 01:11:32"  "2016-11-24 02:00:00"
'''    
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])

# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])

# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
          (df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]

print(df3)
#                            value  row
# date                                 
# 2016-11-24 00:00:00  1759.199951    0
# 2016-11-24 00:10:00   992.400024    0
# 2016-11-24 01:00:00    82.499999    1
# 2016-11-24 01:10:00    37.400003    2
# 2016-11-24 01:20:00   159.899994    2

我試圖用實驗來修改@根的代碼query大熊貓方法 對於非常大的dataFrame,它應該比原始實現更快。 對於小型dataFrame,它肯定會更慢。

def root_searchsorted_modified(df1, df2):
    # Add the end of the time interval to df1.
    df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)

    # Get the insertion indexes for the endpoints of the intervals from df1.
    s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
    s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)

    # ---- further is the MODIFIED code ----
    # Filter df1 to only overlapping intervals.
    df1.query('(date <= @s1.end_date.values) |\
               (date_end <= @s1.end_date.values) |\
               (@s1.index.values != @s2.index.values)', inplace=True)

    # Drop the extra 'date_end' column.
    return df1.drop('date_end', axis=1)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM