[英]Subset df using timestamps - Python
I'm hoping to subset a df using specific timestamps plus an additional period of time.我希望使用特定的时间戳加上额外的时间段来对 df 进行子集化。 Using below, df
contains specific timestamps that I want to use to subset df2
.在下面使用, df
包含我想用于子集df2
的特定时间戳。 Essentially, I use the timestamps in df and determine the previous minute.本质上,我使用 df 中的时间戳并确定前一分钟。 These periods of time are then used to create individual df's, which are concatenated together to create the final df.然后使用这些时间段来创建单独的 df,将它们连接在一起以创建最终的 df。
However, this is inefficient by itself, but becomes even more so when dealing with multiple times.然而,这本身是低效的,但在处理多次时更是如此。
import pandas as pd
df = pd.DataFrame({
'Time' : ['2020-08-02 10:01:12.5','2020-08-02 11:01:12.5','2020-08-02 12:31:00.0','2020-08-02 12:41:22.6'],
'ID' : ['X','Y','B','X'],
})
# 1 min before timestamp
'2020-08-02 10:00:12.5'
# first timestamp
'2020-08-02 10:01:12.5'
# 1 min before timestamp
'2020-08-02 11:00:02.1'
# second timestamp
'2020-08-02 11:01:02.1'
df2 = pd.DataFrame({
'Time' : ['2020-08-02 10:00:00.1','2020-08-02 10:00:00.2','2020-08-02 10:00:00.3','2020-08-02 10:00:00.4'],
'ID' : ['','','',''],
})
d1 = df2[(df2['Time'] > '2020-08-02 10:00:12.5') & (df2['Time'] <= '2020-08-02 10:01:12.5')]
d2 = df2[(df2['Time'] > '2020-08-02 11:00:02.1') & (df2['Time'] <= '2020-08-02 11:01:02.1')]
df_out = pd.concat([d1,d2])#...include all separate periods of time
Intended Output:预期 Output:
Time ID
2020-08-02 10:00:12.5
2020-08-02 10:00:12.6
...
2020-08-02 11:01:12.5 X
2020-08-02 11:00:02.1
2020-08-02 11:00:02.2
...
2020-08-02 11:01:02.1 Y
There's the merge_asof
method in pandas that does just that. pandas 中的merge_asof
方法就是这样做的。
Let me use slightly different timestamps compared to the original post to make it a bit easier to illustrate.让我使用与原始帖子相比略有不同的时间戳,以便更容易说明。 I'll set df1
timestamps at 10:01
, 10:03
and 10:06
for the purpose of this example.出于本示例的目的,我将df1
时间戳设置为10:01
、 10:03
和10:06
。
Let's add 1MinBefore
column to df
with the timestamp one minute before the Time
(we'll use it later to merge the dataframes):让我们将1MinBefore
列添加到df
中,时间戳比Time
早一分钟(我们稍后将使用它来合并数据帧):
df = pd.DataFrame({
'Time' : ['2020-08-02 10:01:00','2020-08-02 10:03:00','2020-08-02 10:06:00'],
'ID' : ['X','Y','Z'],
})
df['Time'] = pd.to_datetime(df['Time'])
df['1MinBefore'] = df['Time'] - pd.Timedelta('1min')
So our df
is:所以我们的df
是:
Time ID 1MinBefore
0 2020-08-02 10:01:00 X 2020-08-02 10:00:00
1 2020-08-02 10:03:00 Y 2020-08-02 10:02:00
2 2020-08-02 10:06:00 Z 2020-08-02 10:05:00
Let's use the range between 10:00
and 10:07
with 30 second intervals for df2
:让我们使用10:00
到10:07
之间的范围,间隔为 30 秒df2
:
df2 = pd.DataFrame({
'Time' : pd.date_range(
start='2020-08-02 10:00:00',
end='2020-08-02 10:07:00',
freq='30s'),
'ID' : '',
})
And now the key step, merging these dataframes with merge_asof
:现在是关键步骤,将这些数据帧与merge_asof
合并:
pd.merge_asof(df2[['Time']], df[['ID', '1MinBefore']],
left_on='Time', right_on='1MinBefore',
tolerance=pd.Timedelta('1min')
Output: Output:
Time ID 1MinBefore
0 2020-08-02 10:00:00 X 2020-08-02 10:00:00
1 2020-08-02 10:00:30 X 2020-08-02 10:00:00
2 2020-08-02 10:01:00 X 2020-08-02 10:00:00
3 2020-08-02 10:01:30 NaN NaT
4 2020-08-02 10:02:00 Y 2020-08-02 10:02:00
5 2020-08-02 10:02:30 Y 2020-08-02 10:02:00
6 2020-08-02 10:03:00 Y 2020-08-02 10:02:00
7 2020-08-02 10:03:30 NaN NaT
8 2020-08-02 10:04:00 NaN NaT
9 2020-08-02 10:04:30 NaN NaT
10 2020-08-02 10:05:00 Z 2020-08-02 10:05:00
11 2020-08-02 10:05:30 Z 2020-08-02 10:05:00
12 2020-08-02 10:06:00 Z 2020-08-02 10:05:00
13 2020-08-02 10:06:30 NaN NaT
14 2020-08-02 10:07:00 NaN NaT
The tolerance
parameter of 1 minute basically tells it that values in df
'older' than 1 minute should be disregarded. 1 分钟的tolerance
参数基本上告诉它应该忽略df
“早于”1 分钟的值。
Now we can of course drop 1MinBefore
column and use fillna
on the ID
column to make it look exactly like the Intended Output
in the original post.现在我们当然可以删除1MinBefore
列并在ID
列上使用fillna
使其看起来与原帖中的Intended Output
完全相同。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.