简体   繁体   English

比较和匹配 pandas 两个不同数据帧中的时间戳范围

[英]Compare and match range of timestamps in pandas two different dataframes

How to compare and match beginning and end of two ranges of timestamps in two different dataframes, when the frequency of timestamps varies, and it is not known which range starts earlies and finishes later.如何比较和匹配两个不同数据帧中两个时间戳范围的开始和结束,当时间戳的频率变化时,不知道哪个范围开始得早,结束得晚。 Then discard unmatched beginning and end, so the two ranges are the same.然后丢弃不匹配的开始和结束,所以两个范围是相同的。 Easy to do it manually in a txt file, how to do it in python and pandas dataframes?很容易在 txt 文件中手动完成,如何在 python 和 pandas 数据帧中完成?

Sample first dataframe:样本先 dataframe:

                         0                          1
0      2022-10-30 14:11:57
1      2022-10-30 14:11:57
2      2022-10-30 14:11:57
3      2022-10-30 14:11:58
4      2022-10-30 14:11:59
                   ...                        ...
149801 2022-10-30 15:22:11
149802 2022-10-30 15:22:11
149803 2022-10-30 15:22:11
149804 2022-10-30 15:22:11
149805 2022-10-30 15:22:11

\[149806 rows x 2 columns\]

Sample second dataframe:样本二 dataframe:

                        0                          1
0     2022-10-30 14:11:59
1     2022-10-30 14:11:59
2     2022-10-30 14:12:00
3     2022-10-30 14:12:00
4     2022-10-30 14:12:00
                  ...                        ...
21065 2022-10-30 15:22:11
21066 2022-10-30 15:22:11
21067 2022-10-30 15:22:12
21068 2022-10-30 15:22:13
21069 2022-10-30 15:22:13

Column 1 filled with data第 1 列填充数据

Comparing two timestamps in a specific row would look like:比较特定行中的两个时间戳如下所示:

if first_df[0].iloc[0] == second_df[0].iloc[0]:
    print('hit')
else:
    print('miss')

How to do it over full range, so it would be possible to discard unmatched beginning and end while preserving what's inside?如何在整个范围内做到这一点,以便可以在保留内部内容的同时丢弃不匹配的开始和结束?

Sample match of those two ranges: First dataframe:这两个范围的样本匹配:第一个 dataframe:

                         0                          1
4      2022-10-30 14:11:59
                   ...                        ...
149801 2022-10-30 15:22:11
149802 2022-10-30 15:22:11
149803 2022-10-30 15:22:11
149804 2022-10-30 15:22:11
149805 2022-10-30 15:22:11

Second dataframe:第二个 dataframe:

                        0                          1
0     2022-10-30 14:11:59
1     2022-10-30 14:11:59
2     2022-10-30 14:12:00
3     2022-10-30 14:12:00
4     2022-10-30 14:12:00
                  ...                        ...
21065 2022-10-30 15:22:11
21066 2022-10-30 15:22:11

Edit:编辑:

Consider this code (note that frequency of timestamps in each dataframe is different):考虑这段代码(注意每个 dataframe 中时间戳的频率是不同的):

import pandas as pd
from datetime import datetime

df1 = pd.DataFrame({'val_1' : [10,11,12,13,14,15]}, 
                   index = [pd.DatetimeIndex([datetime.strptime(s, '%Y-%m-%d %H:%M:%S')])[0] 
                            for s in ['2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:04:00',
                                      '2022-11-12 09:04:01',
                                      '2022-11-12 09:04:02' 
                                      ] ])

df2 = pd.DataFrame({'val_2': [11,22,33,44]},
                   index = [pd.DatetimeIndex([datetime.strptime(s, '%Y-%m-%d %H:%M:%S')])[0] 
                            for s in ['2022-11-12 09:03:58',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:03:59',
                                      '2022-11-12 09:04:00',
                                      ] ])

What I would like as result is this:我想要的结果是:

                     val_1  val_2
2022-11-12 09:03:59     10    NaN
2022-11-12 09:03:59     11     22
2022-11-12 09:03:59     12     33
2022-11-12 09:04:00     13     44

or: df1:或:df1:

2022-11-12 09:03:59     10
2022-11-12 09:03:59     11
2022-11-12 09:03:59     12
2022-11-12 09:04:00     13

and df2:和 df2:

2022-11-12 09:03:59     22
2022-11-12 09:03:59     33
2022-11-12 09:04:00     44

Tried both join and merge with probably every combination of options and can't do that.尝试使用可能的每个选项组合加入和合并,但无法做到这一点。

New answer on the new example data:新示例数据的新答案:

The problem with merging here is that you have duplicated index Dates, so there can't be unambigous assignment done.此处合并的问题是您有重复的索引日期,因此无法完成明确的分配。

But you could do it seperately as you suggested in the beginning.但是您可以按照开始时的建议单独进行。 You said you don't know which of both df's have start earlier or end later.你说你不知道两个 df 中哪个开始得早或结束得晚。 Find the min value of both indexes and get the max value of these two.找到两个索引的最小值并获得这两个的最大值。 Same for the upper bound, get both max values and take the min value of these two values.上限相同,获取两个最大值并取这两个值的最小值。 Then you slice your df's with the lower and upper bound.然后你用下限和上限分割你的 df。

lower, upper = max(df1.index.min(), df2.index.min()), min(df1.index.max(), df2.index.max())

df1 = df1.loc[lower:upper]
print(df1)

                     val_1
2022-11-12 09:03:59     10
2022-11-12 09:03:59     11
2022-11-12 09:03:59     12
2022-11-12 09:04:00     13

df2 = df2.loc[lower:upper]
print(df2)

                     val_2
2022-11-12 09:03:59     22
2022-11-12 09:03:59     33
2022-11-12 09:04:00     44

OLD :旧的
Since you didn't provide usable data, here my own example input data:由于您没有提供可用数据,这里是我自己的示例输入数据:

np.random.seed(42)
df1 = pd.DataFrame(
    {
        'A' : np.random.randint(0,10, size=10)
    },
    index= pd.date_range('2022-11-26 08:00', periods=10, freq='10T')
)

df2 = pd.DataFrame(
    {
        'B' : np.random.randint(0,10, size=10)
    },
    index= pd.date_range('2022-11-26 08:30', periods=10, freq='10T')
)

which creates this data:创建此数据:

#df1
                     A
2022-11-26 08:00:00  6
2022-11-26 08:10:00  3
2022-11-26 08:20:00  7
2022-11-26 08:30:00  4
2022-11-26 08:40:00  6
2022-11-26 08:50:00  9
2022-11-26 09:00:00  2
2022-11-26 09:10:00  6
2022-11-26 09:20:00  7
2022-11-26 09:30:00  4

#df2
                     B
2022-11-26 08:30:00  3
2022-11-26 08:40:00  7
2022-11-26 08:50:00  7
2022-11-26 09:00:00  2
2022-11-26 09:10:00  5
2022-11-26 09:20:00  4
2022-11-26 09:30:00  1
2022-11-26 09:40:00  7
2022-11-26 09:50:00  5
2022-11-26 10:00:00  1

I think a decent approach still would be to merge the data to find out the edges that are off.我认为一个不错的方法仍然是合并数据以找出关闭的边缘。 Just a offer, if you leave them merged, you could compare them directly like this:只是一个报价,如果您将它们合并,您可以像这样直接比较它们:

combined = df1.merge(df2, how='inner', left_index=True, right_index=True)
combined['compare'] = np.where(combined['A']==combined['B'], 'hit', 'miss')
print(combined)

Output of combined : Output combined

                     A  B compare
2022-11-26 08:30:00  4  3    miss
2022-11-26 08:40:00  6  7    miss
2022-11-26 08:50:00  9  7    miss
2022-11-26 09:00:00  2  2     hit
2022-11-26 09:10:00  6  5    miss
2022-11-26 09:20:00  7  4    miss
2022-11-26 09:30:00  4  1    miss

If you really need them to stay seperated, just add:如果你真的需要他们分开,只需添加:

df1_new = combined[['A']]
df2_new = combined[['B']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM