[英]Pandas: select DF rows based on another DF
我有兩個數據幀(很長,每個有數百或數千行)。 其中一個名為df1
,包含一個時間序列,間隔為10分鍾。 例如:
date value 2016-11-24 00:00:00 1759.199951 2016-11-24 00:10:00 992.400024 2016-11-24 00:20:00 1404.800049 2016-11-24 00:30:00 45.799999 2016-11-24 00:40:00 24.299999 2016-11-24 00:50:00 159.899994 2016-11-24 01:00:00 82.499999 2016-11-24 01:10:00 37.400003 2016-11-24 01:20:00 159.899994 ....
而另一個, df2
,包含日期時間間隔:
start_date end_date 0 2016-11-23 23:55:32 2016-11-24 00:14:03 1 2016-11-24 01:03:18 2016-11-24 01:07:12 2 2016-11-24 01:11:32 2016-11-24 02:00:00 ...
我需要選擇df1
中的所有行,這些行“落入” df2
一個區間。
通過這些示例,結果數據框應為:
date value 2016-11-24 00:00:00 1759.199951 # Fits in row 0 of df2 2016-11-24 00:10:00 992.400024 # Fits in row 0 of df2 2016-11-24 01:00:00 82.499999 # Fits in row 1 of df2 2016-11-24 01:10:00 37.400003 # Fits on row 2 of df2 2016-11-24 01:20:00 159.899994 # Fits in row 2 of df2 ....
np.searchsorted
: 這是基於np.searchsorted
的變體,似乎比使用intervaltree
或merge
快一個數量級,假設我的更大的樣本數據是正確的。
# Ensure the df2 is sorted (skip if it's already known to be).
df2 = df2.sort_values(by=['start_date', 'end_date'])
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Perform the searchsorted and get the corresponding df2 values for both endpoints of df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
df1 = df1[np.any(cond, axis=0)].drop('date_end', axis=1)
如果df2
中的間隔嵌套或重疊,則可能需要修改此df2
; 在那種情況下,我還沒有完全考慮過,但它仍然可以工作。
不是一個純粹的Pandas解決方案,但您可能需要考慮從df2
構建一個Interval Tree ,並根據df1
中的間隔查詢它以找到重疊的。
PyPI上的intervaltree
包似乎具有良好的性能和易於使用的語法。
from intervaltree import IntervalTree
# Build the Interval Tree from df2.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter df1.
df1 = df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
出於性能原因,我將日期轉換為等價的整數。 我懷疑intervaltree
包是用pd.Timestamp
構建的,所以可能有一些中間轉換步驟會讓事情變慢。
另請注意,雖然包含起點,但intervaltree
包中的intervaltree
不包括終點。 這就是我在創建tree
時得到+ [0, 1]
的原因; 我將終點填充一個納秒,以確保實際包含真正的終點。 這也是為什么我可以在查詢樹時添加pd.offsets.Minute(10)
以獲得間隔結束,而不是僅添加9m 59s。
兩種方法的結果輸出:
date value
0 2016-11-24 00:00:00 1759.199951
1 2016-11-24 00:10:00 992.400024
6 2016-11-24 01:00:00 82.499999
7 2016-11-24 01:10:00 37.400003
8 2016-11-24 01:20:00 159.899994
使用以下設置生成更大的樣本數據:
# Sample df1.
n1 = 55000
df1 = pd.DataFrame({'date': pd.date_range('2016-11-24', freq='10T', periods=n1), 'value': np.random.random(n1)})
# Sample df2.
n2 = 500
df2 = pd.DataFrame({'start_date': pd.date_range('2016-11-24', freq='18H22T', periods=n2)})
# Randomly shift the start and end dates of the df2 intervals.
shift_start = pd.Series(np.random.randint(30, size=n2)).cumsum().apply(lambda s: pd.DateOffset(seconds=s))
shift_end1 = pd.Series(np.random.randint(30, size=n2)).apply(lambda s: pd.DateOffset(seconds=s))
shift_end2 = pd.Series(np.random.randint(5, 45, size=n2)).apply(lambda m: pd.DateOffset(minutes=m))
df2['start_date'] += shift_start
df2['end_date'] = df2['start_date'] + shift_end1 + shift_end2
這為df1
和df2
產生以下結果:
df1
date value
0 2016-11-24 00:00:00 0.444939
1 2016-11-24 00:10:00 0.407554
2 2016-11-24 00:20:00 0.460148
3 2016-11-24 00:30:00 0.465239
4 2016-11-24 00:40:00 0.462691
...
54995 2017-12-10 21:50:00 0.754123
54996 2017-12-10 22:00:00 0.401820
54997 2017-12-10 22:10:00 0.146284
54998 2017-12-10 22:20:00 0.394759
54999 2017-12-10 22:30:00 0.907233
df2
start_date end_date
0 2016-11-24 00:00:19 2016-11-24 00:41:24
1 2016-11-24 18:22:44 2016-11-24 18:36:44
2 2016-11-25 12:44:44 2016-11-25 13:03:13
3 2016-11-26 07:07:05 2016-11-26 07:49:29
4 2016-11-27 01:29:31 2016-11-27 01:34:32
...
495 2017-12-07 21:36:04 2017-12-07 22:14:29
496 2017-12-08 15:58:14 2017-12-08 16:10:35
497 2017-12-09 10:20:21 2017-12-09 10:26:40
498 2017-12-10 04:42:41 2017-12-10 05:22:47
499 2017-12-10 23:04:42 2017-12-10 23:44:53
並使用以下函數進行計時:
def root_searchsorted(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# Build the conditions that indicate an overlap (any True condition indicates an overlap).
cond = [
df1['date'].values <= s1['end_date'].values,
df1['date_end'].values <= s2['end_date'].values,
s1.index.values != s2.index.values
]
# Filter df1 to only the overlapping intervals, and drop the extra 'date_end' column.
return df1[np.any(cond, axis=0)].drop('date_end', axis=1)
def root_intervaltree(df1, df2):
# Build the Interval Tree.
tree = IntervalTree.from_tuples(df2.astype('int64').values + [0, 1])
# Build the 10 minutes spans from df1.
dt_pairs = pd.concat([df1['date'], df1['date'] + pd.offsets.Minute(10)], axis=1)
# Query the Interval Tree to filter the DataFrame.
return df1[[tree.overlaps(*p) for p in dt_pairs.astype('int64').values]]
def ptrj(df1, df2):
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
def parfait(df1, df2):
df1['key'] = 1
df2['key'] = 1
df2['row'] = df2.index.values
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
return df3[df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9, seconds=59), inclusive=True) | df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= @s1.end_date.values) |\
(date_end <= @s1.end_date.values) |\
(@s1.index.values != @s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)
我得到以下時間:
%timeit root_searchsorted(df1.copy(), df2.copy())
100 loops best of 3: 9.55 ms per loop
%timeit root_searchsorted_modified(df1.copy(), df2.copy())
100 loops best of 3: 13.5 ms per loop
%timeit ptrj(df1.copy(), df2.copy())
100 loops best of 3: 18.5 ms per loop
%timeit root_intervaltree(df1.copy(), df2.copy())
1 loop best of 3: 4.02 s per loop
%timeit parfait(df1.copy(), df2.copy())
1 loop best of 3: 8.96 s per loop
這個解決方案(我相信它有效)使用pandas.Series.asof
。 在引擎蓋下,它是搜索的一些版本 - 但由於某種原因,它的速度比它與@ root的功能速度相當快四倍 。
我假設所有日期列都是pandas datetime
格式,已排序,並且df2間隔不重疊 。
代碼很短但有點錯綜復雜(下面的解釋)。
# The smallest amount of time - handy when using open intervals:
epsilon = pd.Timedelta(1, 'ns')
# Lookup series (`asof` works best with series) for `start_date` and `end_date` from `df2`:
sdate = pd.Series(data=range(df2.shape[0]), index=df2.start_date)
edate = pd.Series(data=range(df2.shape[0]), index=df2.end_date + epsilon)
# The main function (see explanation below):
def get_it(df1):
# (filling NaN's with -1)
l = edate.asof(df1.date).fillna(-1)
r = sdate.asof(df1.date + (pd.Timedelta(10, 'm') - epsilon)).fillna(-1)
# (taking `values` here to skip indexes, which are different)
mask = l.values < r.values
return df1[mask]
這種方法的優點是雙重的: sdate
和edate
僅評估一次,如果df1
非常大,則main函數可以獲取df1
塊。
說明
pandas.Series.asof返回給定索引的最后一個有效行。 它可以將數組作為輸入並且非常快。
為了本說明起見,讓s[j] = sdate.index[j]
是在第j個日期sdate
和x
是一些任意日期(時間戳)。 總有s[sdate.asof(x)] <= x
(這正是asof
工作原理)並且不難顯示:
j <= sdate.asof(x)
當且僅當s[j] <= x
sdate.asof(x) < j
if且僅當x < s[j]
同樣適用於edate
。 不幸的是,我們不能在1和2中具有相同的不等式(周或嚴格)。
兩個區間[a,b)和[x,y]相交iff x <b和a <= y。 (我們可能會認為a,b來自sdate.index
和edate.index
- 因為屬性1和2而選擇間隔[a,b]閉合打開。)在我們的例子中,x是來自df1
,y = x + 10min-epsilon,a = s [j],b = e [j](注意epsilon已被添加到edate
),其中j是某個數字。
因此,最后,相當於“[a,b)和[x,y]相交”的條件是“sdate.asof(x)<j和j <= edate.asof(y)對於某些數字j”。 它大致歸結為函數get_it
l < r
(以某些技術get_it
為模)。
這並不簡單,但您可以執行以下操作:
首先從兩個數據框中獲取相關的日期列並將它們連接在一起,以便一列是所有日期,另外兩列是表示來自df2的索引的列。 (注意df2在堆疊后獲得多索引)
dfm = pd.concat((df1['date'],df2.stack().reset_index())).sort_values(0)
print(dfm)
0 level_0 level_1
0 2016-11-23 23:55:32 0.0 start_date
0 2016-11-24 00:00:00 NaN NaN
1 2016-11-24 00:10:00 NaN NaN
1 2016-11-24 00:14:03 0.0 end_date
2 2016-11-24 00:20:00 NaN NaN
3 2016-11-24 00:30:00 NaN NaN
4 2016-11-24 00:40:00 NaN NaN
5 2016-11-24 00:50:00 NaN NaN
6 2016-11-24 01:00:00 NaN NaN
2 2016-11-24 01:03:18 1.0 start_date
3 2016-11-24 01:07:12 1.0 end_date
7 2016-11-24 01:10:00 NaN NaN
4 2016-11-24 01:11:32 2.0 start_date
8 2016-11-24 01:20:00 NaN NaN
5 2016-11-24 02:00:00 2.0 end_date
您可以看到df1中的值在右側兩列中具有NaN
,並且由於我們已對日期進行了排序,因此這些行位於start_date
和end_date
行之間(來自df2)。
為了表明df1中的行落在df2的行之間,我們可以插入level_0
列,它給出了:
dfm['level_0'] = dfm['level_0'].interpolate()
0 level_0 level_1
0 2016-11-23 23:55:32 0.000000 start_date
0 2016-11-24 00:00:00 0.000000 NaN
1 2016-11-24 00:10:00 0.000000 NaN
1 2016-11-24 00:14:03 0.000000 end_date
2 2016-11-24 00:20:00 0.166667 NaN
3 2016-11-24 00:30:00 0.333333 NaN
4 2016-11-24 00:40:00 0.500000 NaN
5 2016-11-24 00:50:00 0.666667 NaN
6 2016-11-24 01:00:00 0.833333 NaN
2 2016-11-24 01:03:18 1.000000 start_date
3 2016-11-24 01:07:12 1.000000 end_date
7 2016-11-24 01:10:00 1.500000 NaN
4 2016-11-24 01:11:32 2.000000 start_date
8 2016-11-24 01:20:00 2.000000 NaN
5 2016-11-24 02:00:00 2.000000 end_date
請注意, level_0
列現在包含介於開始日期和結束日期之間的行的整數(數學上,而不是數據類型)(這假定結束日期不會與下一個開始日期重疊)。
現在我們可以過濾出最初在df1中的行:
df_falls = dfm[(dfm['level_0'] == dfm['level_0'].astype(int)) & (dfm['level_1'].isnull())][[0,'level_0']]
df_falls.columns = ['date', 'falls_index']
並與原始數據框合並
df_final = pd.merge(df1, right=df_falls, on='date', how='outer')
這使:
print(df_final)
date value falls_index
0 2016-11-24 00:00:00 1759.199951 0.0
1 2016-11-24 00:10:00 992.400024 0.0
2 2016-11-24 00:20:00 1404.800049 NaN
3 2016-11-24 00:30:00 45.799999 NaN
4 2016-11-24 00:40:00 24.299999 NaN
5 2016-11-24 00:50:00 159.899994 NaN
6 2016-11-24 01:00:00 82.499999 NaN
7 2016-11-24 01:10:00 37.400003 NaN
8 2016-11-24 01:20:00 159.899994 2.0
這與原始數據幀相同,附加falls_index
,表示該行所屬的df2
中行的索引。
考慮交叉連接合並,它返回兩個集合之間的笛卡爾積(所有可能的行對M x N)。 您可以使用merge's on
參數中的全1鍵列來交叉連接。 然后,使用pd.series.between()
在大型返回集上運行過濾器。 具體來說, between()
的系列保持開始日期在date
或date
的9:59范圍內的行落在開始和結束時間內。
但是,在合並之前,創建一個等於日期索引的df1['date']
列,以便它在合並后可以是保留列並用於日期過濾。 另外,創建一個df2['row']
列,在末尾用作行指示符。 對於演示,下面重新創建發布的df1和df2數據幀:
from io import StringIO
import pandas as pd
import datetime as dt
data1 = '''
date value
"2016-11-24 00:00:00" 1759.199951
"2016-11-24 00:10:00" 992.400024
"2016-11-24 00:20:00" 1404.800049
"2016-11-24 00:30:00" 45.799999
"2016-11-24 00:40:00" 24.299999
"2016-11-24 00:50:00" 159.899994
"2016-11-24 01:00:00" 82.499999
"2016-11-24 01:10:00" 37.400003
"2016-11-24 01:20:00" 159.899994
'''
df1 = pd.read_table(StringIO(data1), sep='\s+', parse_dates=[0], index_col=0)
df1['key'] = 1
df1['date'] = df1.index.values
data2 = '''
start_date end_date
"2016-11-23 23:55:32" "2016-11-24 00:14:03"
"2016-11-24 01:03:18" "2016-11-24 01:07:12"
"2016-11-24 01:11:32" "2016-11-24 02:00:00"
'''
df2['key'] = 1
df2['row'] = df2.index.values
df2 = pd.read_table(StringIO(data2), sep='\s+', parse_dates=[0,1])
# CROSS JOIN
df3 = pd.merge(df1, df2, on=['key'])
# DF FILTERING
df3 = df3[(df3['start_date'].between(df3['date'], df3['date'] + dt.timedelta(minutes=9), seconds=59), inclusive=True)) |
(df3['date'].between(df3['start_date'], df3['end_date'], inclusive=True)].set_index('date')[['value', 'row']]
print(df3)
# value row
# date
# 2016-11-24 00:00:00 1759.199951 0
# 2016-11-24 00:10:00 992.400024 0
# 2016-11-24 01:00:00 82.499999 1
# 2016-11-24 01:10:00 37.400003 2
# 2016-11-24 01:20:00 159.899994 2
我試圖用實驗來修改@根的代碼query
大熊貓方法見 。 對於非常大的dataFrame,它應該比原始實現更快。 對於小型dataFrame,它肯定會更慢。
def root_searchsorted_modified(df1, df2):
# Add the end of the time interval to df1.
df1['date_end'] = df1['date'] + pd.DateOffset(minutes=9, seconds=59)
# Get the insertion indexes for the endpoints of the intervals from df1.
s1 = df2.reindex(np.searchsorted(df2['start_date'], df1['date'], side='right')-1)
s2 = df2.reindex(np.searchsorted(df2['start_date'], df1['date_end'], side='right')-1)
# ---- further is the MODIFIED code ----
# Filter df1 to only overlapping intervals.
df1.query('(date <= @s1.end_date.values) |\
(date_end <= @s1.end_date.values) |\
(@s1.index.values != @s2.index.values)', inplace=True)
# Drop the extra 'date_end' column.
return df1.drop('date_end', axis=1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.