Pandas 基于列值的累计和

Question

我有这样一个数据框：

2018 年每一天的 ID 点、调查日期和降水值（+365 列，每天 1 列）。

|id|survey_date|2018/01/01|2018/01/02|...|2018/12/30|2018/12/31|
|--|-----------|----------|----------|---|----------|----------|
|01| 2018/06/06|         6|          |   |         2|          |
|02| 2018/05/25|         1|         3|   |         2|         6|
|03| 2018/06/06|         4|         1|   |          |         1|
|xx| 2018/06/06|         6|          |   |         2|          |

数据框：

           0    1         2    3    4     5     6     7     8          9    \
0     01/08/18   18  45763046  0.7  2.0   7.5   2.3   1.3   0.0   0.000000   
1     31/05/18    3  31902138  0.0  0.0   0.0   0.0  14.8  25.8   3.000000   
2     11/05/18    2  34882144  1.4  0.0   0.0   0.0   0.0  15.6   4.900000   
3     30/05/18    2  44322920  3.6  4.1   6.0  29.7   5.4   0.0   0.000000   
4     29/08/18    2  31102104  0.0  0.0   0.0   0.0  17.1  24.6   7.500000  
...  
           358   359  360  361  362  363  364   365  366  367  
0     ...  2.9   7.9  2.5  0.0  0.0  0.0  0.0   2.2  1.4  0.0  
1     ...  0.0   0.0  0.0  0.0  0.0  0.0  0.0   0.0  0.0  0.0  
2     ...  0.0   0.0  0.0  0.0  0.0  0.0  0.0   0.0  0.0  0.0  
3     ...  5.0  33.1  0.0  0.0  0.0  0.0  0.0  10.1  1.7  0.0  
4     ...  0.0   0.0  0.0  0.0  0.0  0.0  0.0   0.0  0.0  0.0

Column[0] -> survey date

Column[2] -> point_id

Columns[3-367] -> Dates form 01/01 to 31/12

预计 output：

Column [368] -> Sum of values= from survey_date to 30 days before in each row.

示例：如果在point_id = 1中，调查日期是01/08/2018 (column order 217) ，我想在最后一列中添加从06/07/2018 (column[187])到01/08/2018 (column[217])的列总和01/08/2018 (column[217])

我想要一个新列，其中包含每个 ID 调查日期前 30 天的累积降水值。 为此，我想我必须创建一个循环，在该循环中我确定调查日期字段的值，并将该日期的列值总和添加到前 30 个，但我不知道该怎么做它。 有什么建议么？

谢谢你。

Answer 1

如果我理解正确，那么您表的主键（使观察结果唯一的组合）是 id-survey_date。 我将从创建一个 DataFrame 开始

import pandas as pd 
import numpy as np

np.random.seed(40) 
cols = pd.date_range(start="2018-01-01", end="2018-12-31")
ids = ["01", "02", "03", "04"]
survey_date = ["2018-06-06", "2018-06-06", "2018-05-25", "2018-06-06"]
df = pd.DataFrame({"id": ids, "survey_date": survey_date})
for col in cols:
  df[col.strftime("%Y-%m-%d")] = np.random.uniform(1, 10, 4)

由于我使用字符串创建了survey_date字段，因此我将使用df["survey_date"] = pd.to_datetime(df["survey_date"])转换为日期时间。 现在我将稍微更改一下 Dataframe 的格式，让我的生活更轻松

df_new_format = df.melt(id_vars=["id", "survey_date"], var_name="reference_date", value_name="ppt")

现在使用df_new_format.head()我们可以看到 Dataframe 看起来像这样。

现在我们以连续的方式为每个id - reference_date survey_date使用 groupby 和自定义ppt来获得所需间隔内的总和。

让我们首先创建自定义 function，它的输入是分组的 Dataframe 和天数（如果我们想使用不是 30 的数字）

def sum_30_day(df_group, days):
  # condition used to select interval we want
  cond = (df_group["reference_date"].between(df_group["survey_date"]-pd.Timedelta(days=days), df_group["survey_date"]))
  return df_group.loc[cond, "ppt"].sum() # returning the sum of ppt in the right interval

现在我们可以得到每个id - survey_date的ppt总和

sum_30_day = df_new_format.groupby(["id", "survey_date"]).apply(sum_30_day, 30) # this is a series
# In case you want to merge results with the original Dataframe 
sum_30_day.name = "sum_ppt"
df = df.merge(sum_30_day , how='left', left_on=["id", "survey_date"], right_index=True)

Pandas 基于列值的累计和

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-03-28 17:16:38

Pandas 基于列值的累计和

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-03-28 17:16:38

解决方案1
0 已采纳 2022-03-28 17:16:38