简体   繁体   English

以自定义间隔重新采样 Pandas 时间序列并获得一年内的间隔号

[英]Resample Pandas time series at custom interval and get interval number within a year

Context:语境:

I have a data frame similar to this, except that it extends over decades of data:我有一个与此类似的数据框,但它扩展了数十年的数据:

df = pd.DataFrame({'time':['2003-02-02', '2003-02-03', '2003-02-04', '2003-02-05', '2003-02-06', '2003-02-07', '2003-02-08', '2003-02-09','2003-02-10', '2003-02-11'], 'NDVI': [0.505413, 0.504566, 0.503682, 0.502759, 0.501796, 0.500791, 0.499743, 0.498651, 0.497514, 0.496332]})
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d')
df.set_index('time', inplace=True)
Output: 输出:
 NDVI time 2003-02-02 0.505413 2003-02-03 0.504566 2003-02-04 0.503682 2003-02-05 0.502759 2003-02-06 0.501796 2003-02-07 0.500791 2003-02-08 0.499743 2003-02-09 0.498651 2003-02-10 0.497514 2003-02-11 0.496332

Problem:问题:

I would like to:我想要:

  1. Get the mean NDVI value at a custom time interval that starts from the beginning of every year.获取从每年年初开始的自定义时间间隔的平均NDVI值。 If the interval is eg 10 days, values will be binned as [Jan-1 : Jan-10], [Jan-11 : Jan-20] etc. The last interval of the year will have to be either a 5- or 6-day interval depending on being a leap year (ie 360th-365/6th day of the year).如果间隔是例如 10 天,值将被分类为 [Jan-1 : Jan-10]、[Jan-11 : Jan-20] 等。一年的最后一个间隔必须是 5 或 6 -天间隔取决于闰年(即一年中的第 360-365/6 天)。
  2. Add a column for the corresponding interval number, so the output would be something similar to this:为相应的区间编号添加一列,因此输出将类似于以下内容:
 NDVI yr_interval time 2003-01-31 0.505413 4 2003-02-10 0.497514 5

In the above example, the first line represents the 4th 10-day interval of year 2003.在上面的示例中,第一行表示 2003 年的第 4 个 10 天间隔。

Question:题:

How to implement that, knowing that:如何实施,知道:

  • For time series spanning several years, the interval number should restart at every year (a similar behaviour to pandas.Series.dt.week )?对于跨越几年的时间序列,间隔数应该每年重新开始(类似于pandas.Series.dt.week行为)?
  • That the code should be flexible enough to test other time intervals (eg 8 days)?代码应该足够灵活以测试其他时间间隔(例如 8 天)?

How about trying pandas.Series.dt.dayofyear and divide that result by the interval you want?试试pandas.Series.dt.dayofyear并将结果除以你想要的间隔怎么样? This would be equivalent to pandas.Series.dt.week if you used 7 as your interval.如果您使用 7 作为间隔,这将等同于pandas.Series.dt.week

The proof is left as an exercise for the reader.证明留给读者作为练习。

This should work.这应该有效。 I don't know if there is any other efficient way to do this, but here is what it looks like.我不知道是否有任何其他有效的方法可以做到这一点,但这是它的样子。 Here I have grouped the the dates by year and then group each year group by the time interval of 10 days and get the index of each 10 day group within the year group - so that the index resets for every year group在这里,我按年份对日期进行分组,然后按 10 天的时间间隔对每个年份组进行分组,并获取年份组中每个 10 天组的索引 - 这样每个年份组的索引都会重置

Sample data frame - 'time' col has date values from 01/01/2005 to 12/31/2007 and 'NDVI' col has some random values.示例数据框 - 'time' col 具有从 01/01/2005 到 12/31/2007 的日期值,'NDVI' col 具有一些随机值。

np.random.seed(123)
df = pd.DataFrame({'time': pd.date_range('2005-01-01', '2007-12-31', freq='1D'),
                   'NDVI': np.random.randn(1095)})
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d')
df.set_index('time', inplace=True)

The function assign_group_interval takes each group (as df) and groups them by the time interval of 10 days and gets the yr_interval value from the index of the grouped df.函数assign_group_interval将每个组(作为df)按10 天的时间间隔分组,并从分组的df 的索引中获取yr_interval值。 This function will be applied to each year group so that the identifier resets for each year group.此功能将应用于每个年份组,以便为​​每个年份组重置标识符。

def assign_group_interval(year_grp):
    year_grp['period'] = year_grp.groupby(pd.Grouper(freq='10D'), as_index=False)\
        .apply(lambda x: x['NDVI'])\
        .index.get_level_values(0)
    return year_grp

Group the df by year and apply the function assign_group_interval按年份对df进行分组并应用函数assign_group_interval

print(df.groupby(df.index.year).apply(assign_group_interval).reset_index())

Output:输出:

           time      NDVI  period
0    2005-01-01 -1.085631       0
1    2005-01-02  0.997345       0
2    2005-01-03  0.282978       0
3    2005-01-04 -1.506295       0
4    2005-01-05 -0.578600       0
5    2005-01-06  1.651437       0
6    2005-01-07 -2.426679       0
7    2005-01-08 -0.428913       0
8    2005-01-09  1.265936       0
9    2005-01-10 -0.866740       0
10   2005-01-11 -0.678886       1
11   2005-01-12 -0.094709       1
12   2005-01-13  1.491390       1
13   2005-01-14 -0.638902       1
14   2005-01-15 -0.443982       1
15   2005-01-16 -0.434351       1
16   2005-01-17  2.205930       1
17   2005-01-18  2.186786       1
18   2005-01-19  1.004054       1
19   2005-01-20  0.386186       1
20   2005-01-21  0.737369       2
21   2005-01-22  1.490732       2
22   2005-01-23 -0.935834       2
23   2005-01-24  1.175829       2
24   2005-01-25 -1.253881       2
25   2005-01-26 -0.637752       2
...
...
455  2006-04-01  0.104061       9
456  2006-04-02  0.165957       9
457  2006-04-03  1.601908       9
458  2006-04-04  0.058687       9
459  2006-04-05  1.064423       9
460  2006-04-06 -0.039329       9
461  2006-04-07  1.448904       9
462  2006-04-08 -1.870397       9
463  2006-04-09 -0.598732       9
464  2006-04-10  0.983033       9
465  2006-04-11 -0.171596      10
466  2006-04-12  0.931530      10
467  2006-04-13  0.385066      10
468  2006-04-14  0.945877      10
469  2006-04-15  0.613068      10
470  2006-04-16  0.673649      10
471  2006-04-17  1.492455      10
472  2006-04-18  0.986474      10
473  2006-04-19  0.993807      10
474  2006-04-20  0.020419      10
475  2006-04-21 -0.581850      11
476  2006-04-22 -0.659560      11
477  2006-04-23  0.750945      11
478  2006-04-24 -2.438461      11
479  2006-04-25 -1.307178      11
480  2006-04-26 -0.963254      11
...
...
830  2007-04-11 -0.482365      10
831  2007-04-12  1.079796      10
832  2007-04-13 -0.421079      10
833  2007-04-14 -1.166471      10
834  2007-04-15  0.856555      10
835  2007-04-16 -0.017391      10
836  2007-04-17  1.448577      10
837  2007-04-18  0.892200      10
838  2007-04-19 -0.229427      10
839  2007-04-20 -0.449668      10
840  2007-04-21  0.023372      11
841  2007-04-22  0.190210      11
842  2007-04-23 -0.881749      11
843  2007-04-24  0.841940      11
844  2007-04-25 -0.397363      11
845  2007-04-26 -0.423028      11
846  2007-04-27 -0.540688      11
847  2007-04-28  0.231017      11
848  2007-04-29 -0.692053      11
849  2007-04-30  0.134970      11
850  2007-05-01  2.766603      12
851  2007-05-02 -0.053609      12
852  2007-05-03 -0.434005      12
853  2007-05-04 -1.667689      12
854  2007-05-05  0.050222      12
855  2007-05-06 -1.109231      12
...
...
1070 2007-12-07  0.923065      34
1071 2007-12-08 -0.822000      34
1072 2007-12-09  1.607085      34
1073 2007-12-10  0.737825      34
1074 2007-12-11 -0.403760      34
1075 2007-12-12 -2.114548      34
1076 2007-12-13 -0.000311      34
1077 2007-12-14 -1.181809      34
1078 2007-12-15  0.299635      34
1079 2007-12-16  1.451169      34
1080 2007-12-17  0.160060      35
1081 2007-12-18 -0.178013      35
1082 2007-12-19  0.342205      35
1083 2007-12-20  0.285650      35
1084 2007-12-21 -2.362864      35
1085 2007-12-22  0.240937      35
1086 2007-12-23  0.620277      35
1087 2007-12-24 -0.259342      35
1088 2007-12-25  0.978559      35
1089 2007-12-26 -0.127675      35
1090 2007-12-27  0.766999      36
1091 2007-12-28  2.273105      36
1092 2007-12-29 -0.096391      36
1093 2007-12-30 -1.942132      36
1094 2007-12-31 -0.336592      36

@cwalvoort's solution is better @cwalvoort 的解决方案更好

df['yr_interval'] = df.index.dayofyear // 10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM