![](/img/trans.png)
[英]Error when trying to create new rolling average column based on another column using groupby of two other columns in pandas data frame
[英]Sum and groupby if date is between two dates in two other columns and create new groupby data frame - pandas
我有以下數據框:
我需要的是總結每個“標題”的綜合瀏覽量值並創建兩個新列:
所以我的最終表格將是這樣的:帖子 ID - 發布日期 - 標題 - 永久鏈接 - 類別 - 作者姓名 - 總頁面瀏覽量(這是沒有任何過濾器的頁面瀏覽量的總和) - 國家 - PT+3 - PT+30
謝謝...
Post ID Published Date Title \
0 824821 2022-05-10 Tom Brady's net worth in 2022
1 824821 2022-05-10 Tom Brady's net worth in 2022
2 824821 2022-05-10 Tom Brady's net worth in 2022
Permalink \
0 https://clutchpoints.com/tom-bradys-net-worth-...
1 https://clutchpoints.com/tom-bradys-net-worth-...
2 https://clutchpoints.com/tom-bradys-net-worth-...
Categories Author Name T+3 T+30 \
0 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
1 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
2 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
country pageviews date
0 Australia 24 2022-05-26
1 India 24 2022-05-24
2 India 12 2022-05-26
好的,所以我懷疑這是最好的方法,但這就是我解決類似問題的方法。
注意:您必須將日期列轉換為日期時間類型才能進行比較。 這可能會解決其他評論者的錯誤
df['Published Date'] = pd.to_datetime(df['Published Date']).apply(lambda x: x.date())
df['date'] = pd.to_datetime(df['date']).apply(lambda x: x.date())
首先,我為 output dataframe 格式創建了一個字典:
aggregate_df = {'Post Id':[],'Published Date':[],'Title':[],'Permalink':[],'Categories':[],'Author Name':[],'Total Page Views':[],'PT+3':[],'PT+30':[]}
然后我遍歷標題列中的每個唯一標題,並為每個標題過濾 dataframe。 然后我將每個值附加到 output 字典(其中大多數是 .max() ,但你也可以使用 [0] 例如,你選擇哪個值並不重要,因為它們是相同的 - 在總頁之外您想要總和的視圖)。
然后,您可以進一步過濾 temp df 以僅顯示您要計算的范圍內的日期,並將 append 這些總和添加到 output 字典中。
for title in df['Title'].unique():
_df = df.loc[(df['Title'] == title)]
aggregate_df['Post Id'].append(_df['Post_Id'].max())
aggregate_df['Published Date'].append(_df['Published Date'].max())
aggregate_df['Title'].append(_df['Title'].max())
aggregate_df['Permalink'].append(_df['Permalink'].max())
aggregate_df['Categories'].append(_df['Categories'].max())
aggregate_df['Author Name'].append(_df['Author Name'].max())
aggregate_df['Total Page Views'].append(_df['Page Views'].sum())
start_period = _df['Published Date'].max()
end_period = _df['Published Date'].max() + dt.timedelta(days=3)
_df = df.loc[(df['Title'] == title) & (df['date'] >= start_period)& (df['date'] <= end_period)]
aggregate_df['PT+3'].append(_df['Page Views'].sum())
start_period = _df['Published Date'].max() + dt.timedelta(days=3)
end_period = _df['Published Date'].max() + dt.timedelta(days=30)
_df = df.loc[(df['Title'] == title) & (df['date'] >= start_period) & (df['date'] <= end_period)]
aggregate_df['PT+30'].append(_df['Page Views'].sum())
aggregate_df = pd.DataFrame(aggregate_df)
IIUC,嘗試:
groupby
和sum
df["Views3"] = df["date"].le(df["T+3"]).mul(df["pageviews"]).groupby(df["Title"]).transform("sum")
df["Views30"] = df["date"].le(df["T+30"]).mul(df["pageviews"]).groupby(df["Title"]).transform("sum")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.