I have the following data frame:
What I need is to sum the values of pageviews of each "Title" and create two new columns:
So my final table will be like this: Post ID - Published Date - Title - Permalink - Categories - Author Name - Total Page Views (Which is the sum of pageviews without any filters) - Country - PT+3 - PT+30
thanks...
Post ID Published Date Title \
0 824821 2022-05-10 Tom Brady's net worth in 2022
1 824821 2022-05-10 Tom Brady's net worth in 2022
2 824821 2022-05-10 Tom Brady's net worth in 2022
Permalink \
0 https://clutchpoints.com/tom-bradys-net-worth-...
1 https://clutchpoints.com/tom-bradys-net-worth-...
2 https://clutchpoints.com/tom-bradys-net-worth-...
Categories Author Name T+3 T+30 \
0 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
1 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
2 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
country pageviews date
0 Australia 24 2022-05-26
1 India 24 2022-05-24
2 India 12 2022-05-26
Ok, so I doubt this is the optimal way to do it, but this is how I've solved a similar problem.
Note: You will have to convert the date columns to datetime type to do comparisons. This might solve the error from the other commenter
df['Published Date'] = pd.to_datetime(df['Published Date']).apply(lambda x: x.date())
df['date'] = pd.to_datetime(df['date']).apply(lambda x: x.date())
First, I created a dictionary for the output dataframe format:
aggregate_df = {'Post Id':[],'Published Date':[],'Title':[],'Permalink':[],'Categories':[],'Author Name':[],'Total Page Views':[],'PT+3':[],'PT+30':[]}
I then looped through every unique title in the title column, and filtered the dataframe for each title. I then appended each value to the output dictionary (it is.max() for most of them, but you could also use [0] for example, it doesn't matter which value you choose because they are identical - outside of total page views which you want the sum).
You can then further filter the temp df to only show the dates in the range you want to count, and append those sums to the output dictionary.
for title in df['Title'].unique():
_df = df.loc[(df['Title'] == title)]
aggregate_df['Post Id'].append(_df['Post_Id'].max())
aggregate_df['Published Date'].append(_df['Published Date'].max())
aggregate_df['Title'].append(_df['Title'].max())
aggregate_df['Permalink'].append(_df['Permalink'].max())
aggregate_df['Categories'].append(_df['Categories'].max())
aggregate_df['Author Name'].append(_df['Author Name'].max())
aggregate_df['Total Page Views'].append(_df['Page Views'].sum())
start_period = _df['Published Date'].max()
end_period = _df['Published Date'].max() + dt.timedelta(days=3)
_df = df.loc[(df['Title'] == title) & (df['date'] >= start_period)& (df['date'] <= end_period)]
aggregate_df['PT+3'].append(_df['Page Views'].sum())
start_period = _df['Published Date'].max() + dt.timedelta(days=3)
end_period = _df['Published Date'].max() + dt.timedelta(days=30)
_df = df.loc[(df['Title'] == title) & (df['date'] >= start_period) & (df['date'] <= end_period)]
aggregate_df['PT+30'].append(_df['Page Views'].sum())
aggregate_df = pd.DataFrame(aggregate_df)
IIUC, try:
groupby
and sum
df["Views3"] = df["date"].le(df["T+3"]).mul(df["pageviews"]).groupby(df["Title"]).transform("sum")
df["Views30"] = df["date"].le(df["T+30"]).mul(df["pageviews"]).groupby(df["Title"]).transform("sum")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.