简体   繁体   English

使用时间戳的子集 df - Python

[英]Subset df using timestamps - Python

I'm hoping to subset a df using specific timestamps plus an additional period of time.我希望使用特定的时间戳加上额外的时间段来对 df 进行子集化。 Using below, df contains specific timestamps that I want to use to subset df2 .在下面使用, df包含我想用于子集df2的特定时间戳。 Essentially, I use the timestamps in df and determine the previous minute.本质上,我使用 df 中的时间戳并确定前一分钟。 These periods of time are then used to create individual df's, which are concatenated together to create the final df.然后使用这些时间段来创建单独的 df,将它们连接在一起以创建最终的 df。

However, this is inefficient by itself, but becomes even more so when dealing with multiple times.然而,这本身是低效的,但在处理多次时更是如此。

import pandas as pd

df = pd.DataFrame({   
        'Time' : ['2020-08-02 10:01:12.5','2020-08-02 11:01:12.5','2020-08-02 12:31:00.0','2020-08-02 12:41:22.6'],             
        'ID' : ['X','Y','B','X'],                 
    })

# 1 min before timestamp
'2020-08-02 10:00:12.5' 
# first timestamp
'2020-08-02 10:01:12.5' 

# 1 min before timestamp
'2020-08-02 11:00:02.1' 
# second timestamp
'2020-08-02 11:01:02.1' 

 df2 = pd.DataFrame({   
        'Time' : ['2020-08-02 10:00:00.1','2020-08-02 10:00:00.2','2020-08-02 10:00:00.3','2020-08-02 10:00:00.4'],             
        'ID' : ['','','',''],                 
    })

d1 = df2[(df2['Time'] > '2020-08-02 10:00:12.5') & (df2['Time'] <= '2020-08-02 10:01:12.5')]
d2 = df2[(df2['Time'] > '2020-08-02 11:00:02.1') & (df2['Time'] <= '2020-08-02 11:01:02.1')]

df_out = pd.concat([d1,d2])#...include all separate periods of time

Intended Output:预期 Output:

                    Time ID
   2020-08-02 10:00:12.5  
   2020-08-02 10:00:12.6  
...
   2020-08-02 11:01:12.5  X
   2020-08-02 11:00:02.1
   2020-08-02 11:00:02.2
...
   2020-08-02 11:01:02.1  Y

There's the merge_asof method in pandas that does just that. pandas 中的merge_asof方法就是这样做的。

Let me use slightly different timestamps compared to the original post to make it a bit easier to illustrate.让我使用与原始帖子相比略有不同的时间戳,以便更容易说明。 I'll set df1 timestamps at 10:01 , 10:03 and 10:06 for the purpose of this example.出于本示例的目的,我将df1时间戳设置为10:0110:0310:06

Let's add 1MinBefore column to df with the timestamp one minute before the Time (we'll use it later to merge the dataframes):让我们将1MinBefore列添加到df中,时间戳比Time早一分钟(我们稍后将使用它来合并数据帧):

df = pd.DataFrame({   
    'Time' : ['2020-08-02 10:01:00','2020-08-02 10:03:00','2020-08-02 10:06:00'],
    'ID' : ['X','Y','Z'],                 
})
df['Time'] = pd.to_datetime(df['Time'])
df['1MinBefore'] = df['Time'] - pd.Timedelta('1min')

So our df is:所以我们的df是:

                 Time ID          1MinBefore
0 2020-08-02 10:01:00  X 2020-08-02 10:00:00
1 2020-08-02 10:03:00  Y 2020-08-02 10:02:00
2 2020-08-02 10:06:00  Z 2020-08-02 10:05:00

Let's use the range between 10:00 and 10:07 with 30 second intervals for df2 :让我们使用10:0010:07之间的范围,间隔为 30 秒df2

df2 = pd.DataFrame({   
    'Time' : pd.date_range(
        start='2020-08-02 10:00:00',
        end='2020-08-02 10:07:00',
        freq='30s'),
    'ID' : '',
})

And now the key step, merging these dataframes with merge_asof :现在是关键步骤,将这些数据帧与merge_asof合并:

pd.merge_asof(df2[['Time']], df[['ID', '1MinBefore']],
              left_on='Time', right_on='1MinBefore',
              tolerance=pd.Timedelta('1min')

Output: Output:

                  Time   ID          1MinBefore
0  2020-08-02 10:00:00    X 2020-08-02 10:00:00
1  2020-08-02 10:00:30    X 2020-08-02 10:00:00
2  2020-08-02 10:01:00    X 2020-08-02 10:00:00
3  2020-08-02 10:01:30  NaN                 NaT
4  2020-08-02 10:02:00    Y 2020-08-02 10:02:00
5  2020-08-02 10:02:30    Y 2020-08-02 10:02:00
6  2020-08-02 10:03:00    Y 2020-08-02 10:02:00
7  2020-08-02 10:03:30  NaN                 NaT
8  2020-08-02 10:04:00  NaN                 NaT
9  2020-08-02 10:04:30  NaN                 NaT
10 2020-08-02 10:05:00    Z 2020-08-02 10:05:00
11 2020-08-02 10:05:30    Z 2020-08-02 10:05:00
12 2020-08-02 10:06:00    Z 2020-08-02 10:05:00
13 2020-08-02 10:06:30  NaN                 NaT
14 2020-08-02 10:07:00  NaN                 NaT

The tolerance parameter of 1 minute basically tells it that values in df 'older' than 1 minute should be disregarded. 1 分钟的tolerance参数基本上告诉它应该忽略df “早于”1 分钟的值。

Now we can of course drop 1MinBefore column and use fillna on the ID column to make it look exactly like the Intended Output in the original post.现在我们当然可以删除1MinBefore列并在ID列上使用fillna使其看起来与原帖中的Intended Output完全相同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM