Given the following df:
datetimeindex store sale category weekday
2018-10-13 09:27:01 gbn01 59.99 sporting 1
2018-10-13 09:27:01 gbn02 19.99 sporting 1
2018-10-13 09:27:02 gbn03 15.99 hygine 1
2018-10-13 09:27:03 gbn05 39.99 camping 1
....
2018-10-16 11:59:01 gbn01 19.99 other 0
2018-10-16 11:59:01 gbn02 49.99 sporting 0
2018-10-16 11:59:02 gbn03 10.00 food 0
2018-10-16 11:59:03 gbn05 89.99 electro 0
2018-10-16 12:30:03 gbn01 52.99
....
2018-10-16 21:05:03 gbn03 25.00 alcohol 0
2018-10-16 22:43:03 gbn01 10.05 health 0
Update
After re-reading the reqs it looks like the mean_sales will calculate for that specific timestamp for that store during that period (08:00 to 18:00 or 12:00 to 13:00). My current thinking is to implement the below pseudo but it would currently only work if it was ordered by datetimeindex,store:
#Lunch_Time_Mean
count=0
Lunch_Sum_Previous=0
for r in df:
if LunchHours & WeekDay:
count++
if count=1:
r.Lunch_Mean=r.sale
Lunch_Sum_Previous = r.sale
elif count > 1:
r.Lunch_Mean = Lunch_Sum_Previous + r.sale / count
Lunch_Sum_Previous += r.sale
else:
r.Lunch_Mean=1
count=0
Lunch_Sum_Previous = 0
Above Logic mapped to a table:
datetimeindex store IsWorkingHour count sales working_hour_sum working_hour_cumsum working_hour_mean_sales
13/10/2018 07:27 gbn01 0 0 39.18 0 0 1
13/10/2018 08:27 gbn01 1 1 31.69 31.69 31.69 1
13/10/2018 09:27 gbn01 1 2 99.19 99.19 130.88 1
13/10/2018 10:27 gbn01 1 3 25.89 25.89 156.77 1
13/10/2018 11:27 gbn01 1 4 19.10 19.10 175.87 1
13/10/2018 12:27 gbn01 1 5 82.51 82.51 258.38 1
13/10/2018 13:27 gbn01 1 6 10.82 10.82 269.2 1
13/10/2018 14:27 gbn01 1 7 10.43 10.43 279.63 1
13/10/2018 15:27 gbn01 1 8 15.83 15.83 295.46 1
13/10/2018 16:27 gbn01 1 9 12.53 12.53 307.99 1
13/10/2018 17:27 gbn01 1 10 10.03 10.03 318.02 1
13/10/2018 18:27 gbn01 0 0 54.14 0 0 1
13/10/2018 19:27 gbn01 0 0 20.04 0 0 1
#Above enteries have weekday_mean_sales of 0 because 13/10/2018 is on a weekend.
16/10/2018 07:27 gbn01 0 0 13.34 0 0 1
16/10/2018 08:27 gbn01 1 1 15.84 15.84 15.84 15.84
16/10/2018 09:27 gbn01 1 2 19.14 19.14 34.98 17.49
16/10/2018 10:27 gbn01 1 3 11.64 11.64 46.62 15.54
16/10/2018 11:27 gbn01 1 4 17.54 17.54 64.16 16.04
16/10/2018 12:27 gbn01 1 5 20.84 20.84 85 17
16/10/2018 13:27 gbn01 1 6 50.05 50.05 135.05 22.51
16/10/2018 14:27 gbn01 1 7 10.05 10.05 145.1 20.73
16/10/2018 15:27 gbn01 1 8 13.35 13.35 158.45 19.81
16/10/2018 16:27 gbn01 1 9 32.55 32.55 191 21.22
16/10/2018 17:27 gbn01 1 10 13.36 13.36 204.36 20.44
16/10/2018 18:27 gbn01 0 0 10.86 0 0 1
16/10/2018 19:27 gbn01 0 0 20.06 0 0 1
I'm attempting to use the above to generate a new df that looks like the below:
#I've simplified it to a single condition and store
datetimeindex store working_hour_mean_sales
13/10/2018 07:27 gbn01 1
13/10/2018 08:27 gbn01 1
13/10/2018 09:27 gbn01 1
13/10/2018 10:27 gbn01 1
13/10/2018 11:27 gbn01 1
13/10/2018 12:27 gbn01 1
13/10/2018 13:27 gbn01 1
13/10/2018 14:27 gbn01 1
13/10/2018 15:27 gbn01 1
13/10/2018 16:27 gbn01 1
13/10/2018 17:27 gbn01 1
13/10/2018 18:27 gbn01 1
13/10/2018 19:27 gbn01 1
#Above weekday_mean_sales=1 because 13/10/2018 was a weekend
16/10/2018 07:27 gbn01 1
16/10/2018 08:27 gbn01 15.84
16/10/2018 09:27 gbn01 17.49
16/10/2018 10:27 gbn01 15.54
16/10/2018 11:27 gbn01 16.04
16/10/2018 12:27 gbn01 17
16/10/2018 13:27 gbn01 22.51
16/10/2018 14:27 gbn01 20.73
16/10/2018 15:27 gbn01 19.81
16/10/2018 16:27 gbn01 21.22
16/10/2018 17:27 gbn01 20.44
16/10/2018 18:27 gbn01 1
16/10/2018 19:27 gbn01 1
Where "working hours" are 08:00-18:00 Mon-Fri and "weekday lunch peak" is 12:00-13:30.
(NB I didn't make the counter-intuitive decision (at least to me) that weekday=0 means mon-fri)
Any suggestions how to implement this into pandas would be greatly appreciated!
You can use groupby()
, agg()
and between()
.
This will aggregate the results for week day lunch peaks Mon-Fri:
df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('12:00:00','13:30:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})
And this will aggregate the results for working hours Mon-Fri:
df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('08:00:00','18:00:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})
Try separating your data to batches and then sum everything you need for every batch. At the end join the results, divide by number of entries and put the results in the columns you need.
Also you can batch data in number of ways, but as per your example, I suggest to group it by category and calculate everything for each of the categories and then join it in the final table.
I hope this helps you :)
this should guide you with the logic you need. Basically, you define a new columns workinghours
, weekdaylunchpeak
and use sqlcode to aggregate (there are other methods).
import pandasql as ps
import datetime
import numpy as np
mydata = pd.DataFrame(data={'datetimeindex': ['13/10/2018 09:27:01','13/10/2018 09:27:02','13/10/2018 09:27:03','13/10/2018 09:27:04','16/10/2018 11:59:01','16/10/2018 11:59:02','16/10/2018 11:59:03','16/10/2018 11:59:04','16/10/2018 21:05:01','16/10/2018 22:43:01'],
'store': ['gbn01','gbn02','gbn03','gbn05','gbn01','gbn02','gbn03','gbn05','gbn03','gbn01'],
'sale': [59.99,19.99,15.99,39.99,19.99,49.99,10,89.99,25,10.05],
'category': ['sporting','sporting','hygine','camping','other','sporting','food','electro','alcohol','health'],
'weekday': [1,1,1,1,0,0,0,0,0,0]
})
mydata['datetimeindex'] = pd.to_datetime(mydata['datetimeindex'])
mydata['workinghours']=(
np.where((mydata.datetimeindex.dt.time >= time(8,00))
&
(mydata.datetimeindex.dt.time<=time(18,00))
&
(mydata.weekday==0)
, 1, 0))
mydata['weekdaylunchpeak']=(
np.where((mydata.datetimeindex.dt.time >= time(12,00))
&
(mydata.datetimeindex.dt.time<=time(13,30))
&
(mydata.weekday==0)
, 1, 0))
sqlcode = '''
SELECT
store,
category,
avg(case when workinghours=1 then sale else 0 end) AS working_hours_mean_sales,
avg(case when weekdaylunchpeak=1 then sale else 0 end) AS weekday_lunch_peak_mean_sales
FROM mydata
GROUP BY
store,
category
;
'''
newdf = ps.sqldf(sqlcode,locals())
newdf
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.