简体   繁体   English

计算年份数据的逐月和逐年变化

[英]Calculate month over month and year over year change for vintage data

I have a dataframe of economic series whose values can get revised every month, adding a new value for a given date and indexing it by realtime_start (see below dataframe).我有一个经济系列的 dataframe,其值可以每月修改一次,为给定日期添加一个新值并通过realtime_start对其进行索引(参见下面的数据框)。 realtime_start indicates the date at which value for date becomes valid. realtime_start指示 date value生效的date This value expires as soon as another one takes its place.一旦另一个值取代它,这个value就会过期。

date日期 realtime_start实时启动 value价值
2020-11-01 2020-11-01 2020-12-04 2020-12-04 142629.0 142629.0
2020-11-01 2020-11-01 2021-01-08 2021-01-08 142764.0 142764.0
2020-11-01 2020-11-01 2021-02-05 2021-02-05 142809.0 142809.0
2020-12-01 2020-12-01 2021-01-08 2021-01-08 142624.0 142624.0
2020-12-01 2020-12-01 2021-02-05 2021-02-05 142582.0 142582.0
2020-12-01 2020-12-01 2021-03-05 2021-03-05 142503.0 142503.0
2021-01-01 2021-01-01 2021-02-05 2021-02-05 142631.0 142631.0
2021-01-01 2021-01-01 2021-03-05 2021-03-05 142669.0 142669.0
2021-01-01 2021-01-01 2021-04-02 2021-04-02 142736.0 142736.0
2021-02-01 2021-02-01 2021-03-05 2021-03-05 143048.0 143048.0
2021-02-01 2021-02-01 2021-04-02 2021-04-02 143204.0 143204.0
2021-03-01 2021-03-01 2021-04-02 2021-04-02 144120.0 144120.0

I would like an easy way to calculate the month-over-month change in value based on the last known entry at date .我想要一种简单的方法来计算基于date的最后一个已知条目的value的月度变化。

Calculation method: take the first release from month n (based on realtime_start ) and subtract the relevant release from month n-1.计算方法:取第 n 个月的第一个版本(基于realtime_start ),减去第 n-1 个月的相关版本。 Relevant release is the most recent release whose realtime_start date does not exceed that of month n.相关版本是realtime_start日期不超过第 n 个月的最新版本。

See desired output below请参阅下面的所需 output

date日期 MoM change环比变化
2020-11-01 2020-11-01 NaN
2020-12-01 2020-12-01 -140 -140
2021-01-01 2021-01-01 49 49
2021-02-01 2021-02-01 379 379
2021-03-01 2021-03-01 916 916

For 2021-03-01 , the MoM change value is 144120.0 - 143204.0 = 916.0对于2021-03-01 ,MoM 变化值为144120.0 - 143204.0 = 916.0
For 2021-02-01 , the MoM change value is 143048.0 - 142669.0 = 379.0对于2021-02-01 ,MoM 变化值为143048.0 - 142669.0 = 379.0
For 2021-01-01 , the MoM change value is 142631.0 - 142582.0 = 49.0对于2021-01-01 ,MoM 变化值为142631.0 - 142582.0 = 49.0

Similarly, I would like to calculate the year-over-year change based on the last known values at date (actual data frame extends further into the past).同样,我想根据date的最后一个已知值计算同比变化(实际数据框延伸到过去)。 I would also like to calculate the 3-month (rolling) average of month-over-month change based on last known values at date .我还想根据date的最后一个已知值计算月度变化的 3 个月(滚动)平均值。

economic series dataframe经济系列dataframe

Solution解决方案

df = df.set_index('date')

first = df.groupby(level=0).first()
m = df['realtime_start'].le(first['realtime_start'].shift(-1))
last_val = df['value'].mask(~m).groupby(level=0).last().shift()

mom_change = (first['value'] - last_val).reset_index(name='MoM change')

Explanations解释

Set the index of the dataframe to the column date then group the dataframe on level=0 and aggregate using first to select the first row for each unique date将 dataframe 的index设置为列date ,然后将 dataframe grouplevel=0并使用first聚合到 select 每个唯一date的第一行

>>> first
           realtime_start     value
date                               
2020-11-01     2020-12-04  142629.0
2020-12-01     2021-01-08  142624.0
2021-01-01     2021-02-05  142631.0
2021-02-01     2021-03-05  143048.0
2021-03-01     2021-04-02  144120.0

Shift the column realtime_start in the first dataframe, then compare it with realtime_start column in df to create a boolean mask mfirst dataframe 中的realtime_start列移动,然后将其与df中的realtime_start列进行比较以创建 boolean 掩码m

>>> m

date
2020-11-01     True
2020-11-01     True
2020-11-01    False
2020-12-01     True
2020-12-01     True
2020-12-01    False
2021-01-01     True
2021-01-01     True
2021-01-01    False
2021-02-01     True
2021-02-01     True
2021-03-01    False
Name: realtime_start, dtype: bool

Now mask the values in the value column using the above boolean mask then group this masked column on level=0 and aggregate using last to select last row for each unique id现在使用上面的 boolean 掩码屏蔽value列中的值,然后将此屏蔽列分组到level=0并使用 last 聚合到 select 每个唯一 ID 的最后一行

>>> last

date
2020-11-01         NaN
2020-12-01    142764.0
2021-01-01    142582.0
2021-02-01    142669.0
2021-03-01    143204.0
Name: value, dtype: float64

Subtract the value column in first dataframe from the calculated last_val column to calculate the MoM change从计算的last_val列中减去first dataframe 中的value列以计算MoM change

>>> mom_change

        date  MoM change
0 2020-11-01         NaN
1 2020-12-01      -140.0
2 2021-01-01        49.0
3 2021-02-01       379.0
4 2021-03-01       916.0

PS: The dataframe must be sorted on date column in order for this solution to work properly PS:dataframe 必须按date列排序才能使此解决方案正常工作

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM