简体   繁体   English

创建时间序列的最快方法 dataframe

[英]Fastest way to create time series dataframe

I want to take data that contains time gaps as well as time repeats and basically create a time series using the first occurrence of any given time and filling forwards.我想获取包含时间间隔和时间重复的数据,并且基本上使用任何给定时间的第一次出现并向前填充来创建时间序列。 Consider the following example.考虑以下示例。

Lets say this is the time range we are interested in:假设这是我们感兴趣的时间范围:

Time时间
1:00 1:00
1:01 1:01
1:02 1:02
1:03 1:03
1:04 1:04
1:05 1:05

And this is the data, dataframe X, we would like to put into our time series:这是数据,dataframe X,我们想放入我们的时间序列:

Occurance出现 Value价值
1:00 1:00 "R" “R”
1:03 1:03 "G" “G”
1:03 1:03 "L" “大号”
1:03 1:03 "P" “P”
1:03 1:03 "T" “T”
1:05 1:05 "S" “S”

And this is the Final Dataframe:这是最终的 Dataframe:

Occurance出现 Value价值
1:00 1:00 "R" “R”
1:01 1:01 "R" “R”
1:02 1:02 "R" “R”
1:03 1:03 "G" “G”
1:04 1:04 "G" “G”
1:05 1:05 "S" “S”

As you can see, in the Final Dataframe, 1:00 has the value "R" because that is the value in the first occurrence of 1:00 in dataframe X. 1:01 and 1:02 also have the value "R" because there is no data for those time instances in dataframe X and will therefore use the last valid value (which is the value for 1:00).如您所见,在最终 Dataframe 中,1:00 的值为“R”,因为这是 dataframe X 中第一次出现 1:00 的值。1:01 和 1:02 也具有值“R”因为在 dataframe X 中没有这些时间实例的数据,因此将使用最后一个有效值(即 1:00 的值)。 1:03 has the value "G" because, similar to the case with 1:00, "G" is the first value for 1:03 that we have in dataframe X. Since there is no value for 1:04 in dataframe X, 1:04 gets the last valid value, "G", in our resulting dataframe. 1:03 的值为“G”,因为与 1:00 的情况类似,“G”是我们在 dataframe X 中拥有的 1:03 的第一个值。因为在 dataframe X 中没有 1:04 的值, 1:04 在我们生成的 dataframe 中获取最后一个有效值“G”。 Lastly, 1:05 will have the value "S" in our resulting dataframe as that is the value for the first occurrence of 1:05 in dataframe X.最后,1:05 在我们生成的 dataframe 中将具有值“S”,因为这是 dataframe X 中第一次出现 1:05 的值。

What is the quickest way to accomplish this?最快的方法是什么?

merge_asof

See Solution below to see final solution.请参阅下面的解决方案以查看最终解决方案。

First, we need to change those columns to pd.Timedelta :首先,我们需要将这些列更改为pd.Timedelta

df1['Time'] = pd.to_timedelta(df1['Time'] + ':00')
df2['Occurance'] = pd.to_timedelta(df2['Occurance'] + ':00')

Then we can merge_asof然后我们可以merge_asof

pd.merge_asof(df1, df2, left_on='Time', right_on='Occurance')

             Time       Occurance Value
0 0 days 01:00:00 0 days 01:00:00     R
1 0 days 01:01:00 0 days 01:00:00     R
2 0 days 01:02:00 0 days 01:00:00     R
3 0 days 01:03:00 0 days 01:03:00     T
4 0 days 01:04:00 0 days 01:03:00     T
5 0 days 01:05:00 0 days 01:05:00     S

There are a couple of things wrong with this:这有几个问题:

  1. There are more columns than OP specified.列数超过指定的 OP。
  2. I have T s instead of G s我有T s 而不是G s

Ok to get rid of the columns, we just rename one of the columns instead of using left_on / right_on好的摆脱列,我们只是重命名其中一列而不是使用left_on / right_on

pd.merge_asof(df1.set_axis(['Occurance'], axis=1), df2)

        Occurance Value
0 0 days 01:00:00     R
1 0 days 01:01:00     R
2 0 days 01:02:00     R
3 0 days 01:03:00     T
4 0 days 01:04:00     T
5 0 days 01:05:00     S

But we still have T s instead of G s and that's because look at df2但是我们仍然有T s 而不是G s,那是因为看df2

        Occurance Value
0 0 days 01:00:00     R
1 0 days 01:03:00     G  # same Occurance
2 0 days 01:03:00     L  # same Occurance
3 0 days 01:03:00     P  # same Occurance
4 0 days 01:03:00     T  # same Occurance
5 0 days 01:05:00     S

Obviously, Pandas took the last one while OP wanted first one.显然,Pandas 拿了最后一个,而 OP 想要第一个。 So let's try again with drop_duplicates所以让我们用drop_duplicates再试一次

Solution解决方案

pd.merge_asof(df1.set_axis(['Occurance'], axis=1),
              df2.drop_duplicates('Occurance'))

        Occurance Value
0 0 days 01:00:00     R
1 0 days 01:01:00     R
2 0 days 01:02:00     R
3 0 days 01:03:00     G
4 0 days 01:04:00     G
5 0 days 01:05:00     S

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM