Pandas 基於列值的累計和

Question

我有這樣一個數據框：

2018 年每一天的 ID 點、調查日期和降水值（+365 列，每天 1 列）。

|id|survey_date|2018/01/01|2018/01/02|...|2018/12/30|2018/12/31|
|--|-----------|----------|----------|---|----------|----------|
|01| 2018/06/06|         6|          |   |         2|          |
|02| 2018/05/25|         1|         3|   |         2|         6|
|03| 2018/06/06|         4|         1|   |          |         1|
|xx| 2018/06/06|         6|          |   |         2|          |

數據框：

           0    1         2    3    4     5     6     7     8          9    \
0     01/08/18   18  45763046  0.7  2.0   7.5   2.3   1.3   0.0   0.000000   
1     31/05/18    3  31902138  0.0  0.0   0.0   0.0  14.8  25.8   3.000000   
2     11/05/18    2  34882144  1.4  0.0   0.0   0.0   0.0  15.6   4.900000   
3     30/05/18    2  44322920  3.6  4.1   6.0  29.7   5.4   0.0   0.000000   
4     29/08/18    2  31102104  0.0  0.0   0.0   0.0  17.1  24.6   7.500000  
...  
           358   359  360  361  362  363  364   365  366  367  
0     ...  2.9   7.9  2.5  0.0  0.0  0.0  0.0   2.2  1.4  0.0  
1     ...  0.0   0.0  0.0  0.0  0.0  0.0  0.0   0.0  0.0  0.0  
2     ...  0.0   0.0  0.0  0.0  0.0  0.0  0.0   0.0  0.0  0.0  
3     ...  5.0  33.1  0.0  0.0  0.0  0.0  0.0  10.1  1.7  0.0  
4     ...  0.0   0.0  0.0  0.0  0.0  0.0  0.0   0.0  0.0  0.0

Column[0] -> survey date

Column[2] -> point_id

Columns[3-367] -> Dates form 01/01 to 31/12

預計 output：

Column [368] -> Sum of values= from survey_date to 30 days before in each row.

示例：如果在point_id = 1中，調查日期是01/08/2018 (column order 217) ，我想在最后一列中添加從06/07/2018 (column[187])到01/08/2018 (column[217])的列總和01/08/2018 (column[217])

我想要一個新列，其中包含每個 ID 調查日期前 30 天的累積降水值。 為此，我想我必須創建一個循環，在該循環中我確定調查日期字段的值，並將該日期的列值總和添加到前 30 個，但我不知道該怎么做它。 有什么建議么？

謝謝你。

Answer 1

如果我理解正確，那么您表的主鍵（使觀察結果唯一的組合）是 id-survey_date。 我將從創建一個 DataFrame 開始

import pandas as pd 
import numpy as np

np.random.seed(40) 
cols = pd.date_range(start="2018-01-01", end="2018-12-31")
ids = ["01", "02", "03", "04"]
survey_date = ["2018-06-06", "2018-06-06", "2018-05-25", "2018-06-06"]
df = pd.DataFrame({"id": ids, "survey_date": survey_date})
for col in cols:
  df[col.strftime("%Y-%m-%d")] = np.random.uniform(1, 10, 4)

由於我使用字符串創建了survey_date字段，因此我將使用df["survey_date"] = pd.to_datetime(df["survey_date"])轉換為日期時間。 現在我將稍微更改一下 Dataframe 的格式，讓我的生活更輕松

df_new_format = df.melt(id_vars=["id", "survey_date"], var_name="reference_date", value_name="ppt")

現在使用df_new_format.head()我們可以看到 Dataframe 看起來像這樣。

現在我們以連續的方式為每個id - reference_date survey_date使用 groupby 和自定義ppt來獲得所需間隔內的總和。

讓我們首先創建自定義 function，它的輸入是分組的 Dataframe 和天數（如果我們想使用不是 30 的數字）

def sum_30_day(df_group, days):
  # condition used to select interval we want
  cond = (df_group["reference_date"].between(df_group["survey_date"]-pd.Timedelta(days=days), df_group["survey_date"]))
  return df_group.loc[cond, "ppt"].sum() # returning the sum of ppt in the right interval

現在我們可以得到每個id - survey_date的ppt總和

sum_30_day = df_new_format.groupby(["id", "survey_date"]).apply(sum_30_day, 30) # this is a series
# In case you want to merge results with the original Dataframe 
sum_30_day.name = "sum_ppt"
df = df.merge(sum_30_day , how='left', left_on=["id", "survey_date"], right_index=True)

Pandas 基於列值的累計和

問題描述

1 個解決方案

解決方案1
0 已采納 2022-03-28 17:16:38

Pandas 基於列值的累計和

問題描述

1 個解決方案

解決方案1 0 已采納 2022-03-28 17:16:38

解決方案1
0 已采納 2022-03-28 17:16:38