[英]Pandas cumulative sum based on column value
I have such a data frame:我有这样一个数据框:
ID point, survey date, and a precipitation value for each day of 2018 (+365 columns, 1 for each day). 2018 年每一天的 ID 点、调查日期和降水值(+365 列,每天 1 列)。
|id|survey_date|2018/01/01|2018/01/02|...|2018/12/30|2018/12/31|
|--|-----------|----------|----------|---|----------|----------|
|01| 2018/06/06| 6| | | 2| |
|02| 2018/05/25| 1| 3| | 2| 6|
|03| 2018/06/06| 4| 1| | | 1|
|xx| 2018/06/06| 6| | | 2| |
Data frame:数据框:
0 1 2 3 4 5 6 7 8 9 \
0 01/08/18 18 45763046 0.7 2.0 7.5 2.3 1.3 0.0 0.000000
1 31/05/18 3 31902138 0.0 0.0 0.0 0.0 14.8 25.8 3.000000
2 11/05/18 2 34882144 1.4 0.0 0.0 0.0 0.0 15.6 4.900000
3 30/05/18 2 44322920 3.6 4.1 6.0 29.7 5.4 0.0 0.000000
4 29/08/18 2 31102104 0.0 0.0 0.0 0.0 17.1 24.6 7.500000
...
358 359 360 361 362 363 364 365 366 367
0 ... 2.9 7.9 2.5 0.0 0.0 0.0 0.0 2.2 1.4 0.0
1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 ... 5.0 33.1 0.0 0.0 0.0 0.0 0.0 10.1 1.7 0.0
4 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Column[0] -> survey date
Column[2] -> point_id
Columns[3-367] -> Dates form 01/01 to 31/12
Expected output:预计 output:
Column [368] -> Sum of values= from survey_date to 30 days before in each row.
Example: if in the point_id = 1
the survey date is 01/08/2018 (column order 217)
I want to add to the last column the sum of columns from 06/07/2018 (column[187])
to 01/08/2018 (column[217])
示例:如果在point_id = 1
中,调查日期是01/08/2018 (column order 217)
,我想在最后一列中添加从06/07/2018 (column[187])
到01/08/2018 (column[217])
的列总和01/08/2018 (column[217])
I would like to have a new column with the cumulative precipitation value for the 30 days prior to the survey date for each id.我想要一个新列,其中包含每个 ID 调查日期前 30 天的累积降水值。 To do this I suppose I have to create a loop in which I identify the value of the survey date field and add the sum of the values of the columns from that date to the previous 30, but I can't figure out how to do it.为此,我想我必须创建一个循环,在该循环中我确定调查日期字段的值,并将该日期的列值总和添加到前 30 个,但我不知道该怎么做它。 Any suggestions?有什么建议么?
Thank you.谢谢你。
If I understood it right the primary key (combination that makes an observation unique) of your table is id-survey_date.如果我理解正确,那么您表的主键(使观察结果唯一的组合)是 id-survey_date。 I'll start by creating a DataFrame with我将从创建一个 DataFrame 开始
import pandas as pd
import numpy as np
np.random.seed(40)
cols = pd.date_range(start="2018-01-01", end="2018-12-31")
ids = ["01", "02", "03", "04"]
survey_date = ["2018-06-06", "2018-06-06", "2018-05-25", "2018-06-06"]
df = pd.DataFrame({"id": ids, "survey_date": survey_date})
for col in cols:
df[col.strftime("%Y-%m-%d")] = np.random.uniform(1, 10, 4)
Since I've created with strings the survey_date
field I'll convert to datetime with df["survey_date"] = pd.to_datetime(df["survey_date"])
.由于我使用字符串创建了survey_date
字段,因此我将使用df["survey_date"] = pd.to_datetime(df["survey_date"])
转换为日期时间。 Now I'll change a bit the format of the Dataframe to make my life easier现在我将稍微更改一下 Dataframe 的格式,让我的生活更轻松
df_new_format = df.melt(id_vars=["id", "survey_date"], var_name="reference_date", value_name="ppt")
Now using df_new_format.head()
we can see that the Dataframe looks like this.现在使用df_new_format.head()
我们可以看到 Dataframe 看起来像这样。
Now that we have in a row manner the reference_date
and ppt
for each id
- survey_date
combination we can use groupby with a custom function to get the sum in the desired interval.现在我们以连续的方式为每个id
- reference_date
survey_date
使用 groupby 和自定义ppt
来获得所需间隔内的总和。
Let's first create the custom function that will have as its inputs a grouped Dataframe and the number of days (in case we want to use a number that isn't 30)让我们首先创建自定义 function,它的输入是分组的 Dataframe 和天数(如果我们想使用不是 30 的数字)
def sum_30_day(df_group, days):
# condition used to select interval we want
cond = (df_group["reference_date"].between(df_group["survey_date"]-pd.Timedelta(days=days), df_group["survey_date"]))
return df_group.loc[cond, "ppt"].sum() # returning the sum of ppt in the right interval
Now we can get the sum of ppt
for each id
- survey_date
with现在我们可以得到每个id
- survey_date
的ppt
总和
sum_30_day = df_new_format.groupby(["id", "survey_date"]).apply(sum_30_day, 30) # this is a series
# In case you want to merge results with the original Dataframe
sum_30_day.name = "sum_ppt"
df = df.merge(sum_30_day , how='left', left_on=["id", "survey_date"], right_index=True)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.