I am reading an excel file with about 300k rows to a pandas dataframe. I am, then, grouping it to about 18000 rows using groupby. Then, I am looping each group and calculating sum doing a filter (date filter for month data) in the group. The whole process is taking about 60 minutes. Is there any way to optimize this? The code is as follows:
qgift_dl = pd.read_csv(file, encoding='latin1')
qgift_dl['user_id'] = df1['user_id'].astype(str) # read csv file
qgift_dl['Gift Date'] = pd.to_datetime(df1['Gift Date'])
min_date = qgift_dl['Gift Date'].min()
today = datetime.datetime.today()
qgift_dates = get_date_range(min_date, today) # get all dates between
q_grouped = qgift_dl.groupby(['user_id'])
details= []
for group in q_grouped:
d_rows = group[1]
d_row_data = [group[0]] # add donor id
for dt in qgift_dates:
lower = dt.strftime('%Y-%m-01')
upper = dt.strftime('%Y-%m-%d')
filtered = d_rows[(d_rows['Gift Date'] >= lower) & (d_rows['Gift Date'] <= upper)]
d_row_data.append(filtered['Amount'].sum())
details.append(d_row_data)
Below is get_date_range function. It gets range of all date (Ymd) between two ranges. In my case the range is '2008-04-30' to '2020-05-30'.
from dateutil.relativedelta import relativedelta
import datetime, calendar
def get_date_range(start, end):
result = []
while start <= end:
result.append(start)
start += relativedelta(months=1)
return result
Sample Excel data is as follow: Link to sample file: https://docs.google.com/spreadsheets/d/1YeH35w0rqVoHukGTSDtISlztdZAiDYsmfLWVia2x1U0/edit?usp=sharing
From the expected result, you want the total amount per user and per month. The pandas tools are groupby
and sum
, and unstack
if you want the dates to be the columns:
result = df.groupby(['user_id', pd.to_datetime(df['Gift Date'], dayfirst=True
)+ pd.offsets.Day() - pd.offsets.MonthBegin()])[['Amount']].sum(
).unstack()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.