[英]pandas: Remove all rows within time interval of another series's time index (i.e. time range exclusion)
Suppose I have two dataframes: 假设我有两个数据帧:
#df1
time
2016-09-12 13:00:00.017 1.0
2016-09-12 13:00:03.233 1.0
2016-09-12 13:00:10.256 1.0
2016-09-12 13:00:19.605 1.0
#df2
time
2016-09-12 13:00:00.017 1.0
2016-09-12 13:00:00.233 0.0
2016-09-12 13:00:01.016 1.0
2016-09-12 13:00:01.505 0.0
2016-09-12 13:00:06.017 1.0
2016-09-12 13:00:07.233 0.0
2016-09-12 13:00:08.256 1.0
2016-09-12 13:00:19.705 0.0
I want to remove all rows in df2
that are up to +1 second of the time indices in df1
, so yielding: 我想删除
df2
中df1
时间指数高达+1秒的所有行,因此产生:
#result
time
2016-09-12 13:00:01.505 0.0
2016-09-12 13:00:06.017 1.0
2016-09-12 13:00:07.233 0.0
2016-09-12 13:00:08.256 1.0
What's the most efficient way to do this? 最有效的方法是什么? I don't see anything useful for time range exclusions in the API.
我认为API中的时间范围排除没有任何用处。
You can use pd.merge_asof
which is a new inclusion starting with 0.19.0
and also accepts a tolerance argument to match +/- that specified amount of time interval. 您可以使用
pd.merge_asof
这是一个以0.19.0
开头的新包含,并且还接受容差参数以匹配+/-指定的时间间隔量。
# Assuming time to be set as the index axis for both df's
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df2.loc[pd.merge_asof(df2, df1, on='time', tolerance=pd.Timedelta('1s')).isnull().any(1)]
Note that default matching is carried out in the backwards direction , which means that selection occurs at the last row in the right DataFrame ( df1
) whose "on"
key (which is "time"
) is less than or equal to the left's ( df2
) key. 请注意,默认匹配是在向后方向上执行的 ,这意味着选择发生在右侧DataFrame(
df1
)的最后一行,其"on"
键(即"time"
)小于或等于left( df2
)钥匙。 Hence, the tolerance
parameter extends only in this direction ( backward ) resulting in a -
range of matching. 因此,
tolerance
参数仅在此方向( 向后 ),产生一个延伸-
范围匹配的。
To have both forward as well as backward lookups possible, starting with 0.20.0
this can be achieved by making use of direction='nearest'
argument and including it in the function call. 要使正向和反向查找成为可能,从
0.20.0
开始,这可以通过使用direction='nearest'
参数并将其包含在函数调用中来实现。 Due to this, the tolerance
also gets extended both ways resulting in a +/-
bandwidth range of matching. 因此,
tolerance
也会以两种方式扩展,从而产生+/-
带宽匹配范围。
Similar idea as @Nickil Maveli, but using reindex
to build a Boolean indexer: 与@Nickil Maveli类似的想法,但使用
reindex
来构建布尔索引器:
df2 = df2[df1.reindex(df2.index, method='nearest', tolerance=pd.Timedelta('1s')).isnull()]
The resulting output: 结果输出:
time
2016-09-12 13:00:01.505 0.0
2016-09-12 13:00:06.017 1.0
2016-09-12 13:00:07.233 0.0
2016-09-12 13:00:08.256 1.0
One way to do it would be to lookup via time indexing (assuming both time columns are indices): 一种方法是通过时间索引进行查找(假设两个时间列都是索引):
td = pd.to_timedelta(1, unit='s')
df2.apply(lambda row: df1[row.name - td:row.name].size > 0, axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.