简体   繁体   中英

Efficient way to create calculated column for Pandas DataFrame

Given the following df:

datetimeindex        store  sale   category  weekday
2018-10-13 09:27:01  gbn01  59.99  sporting  1
2018-10-13 09:27:01  gbn02  19.99  sporting  1
2018-10-13 09:27:02  gbn03  15.99  hygine    1
2018-10-13 09:27:03  gbn05  39.99  camping   1
....
2018-10-16 11:59:01  gbn01  19.99  other     0
2018-10-16 11:59:01  gbn02  49.99  sporting  0
2018-10-16 11:59:02  gbn03  10.00  food      0
2018-10-16 11:59:03  gbn05  89.99  electro   0
2018-10-16 12:30:03  gbn01  52.99
....
2018-10-16 21:05:03  gbn03  25.00  alcohol   0
2018-10-16 22:43:03  gbn01  10.05  health    0

Update

After re-reading the reqs it looks like the mean_sales will calculate for that specific timestamp for that store during that period (08:00 to 18:00 or 12:00 to 13:00). My current thinking is to implement the below pseudo but it would currently only work if it was ordered by datetimeindex,store:

#Lunch_Time_Mean
count=0
Lunch_Sum_Previous=0
for r in df:
    if LunchHours & WeekDay:
        count++
        if count=1:
            r.Lunch_Mean=r.sale
            Lunch_Sum_Previous = r.sale
        elif count > 1:
            r.Lunch_Mean = Lunch_Sum_Previous + r.sale / count
            Lunch_Sum_Previous += r.sale
    else:
        r.Lunch_Mean=1
        count=0
        Lunch_Sum_Previous = 0

Above Logic mapped to a table:

datetimeindex       store    IsWorkingHour    count    sales    working_hour_sum    working_hour_cumsum    working_hour_mean_sales
13/10/2018 07:27    gbn01    0                0        39.18    0                   0                      1
13/10/2018 08:27    gbn01    1                1        31.69    31.69               31.69                  1
13/10/2018 09:27    gbn01    1                2        99.19    99.19               130.88                 1
13/10/2018 10:27    gbn01    1                3        25.89    25.89               156.77                 1
13/10/2018 11:27    gbn01    1                4        19.10    19.10               175.87                 1
13/10/2018 12:27    gbn01    1                5        82.51    82.51               258.38                 1
13/10/2018 13:27    gbn01    1                6        10.82    10.82               269.2                  1
13/10/2018 14:27    gbn01    1                7        10.43    10.43               279.63                 1
13/10/2018 15:27    gbn01    1                8        15.83    15.83               295.46                 1
13/10/2018 16:27    gbn01    1                9        12.53    12.53               307.99                 1
13/10/2018 17:27    gbn01    1                10       10.03    10.03               318.02                 1
13/10/2018 18:27    gbn01    0                0        54.14    0                   0                      1
13/10/2018 19:27    gbn01    0                0        20.04    0                   0                      1
#Above enteries have weekday_mean_sales of 0 because 13/10/2018 is on a weekend.                                                                                         
16/10/2018 07:27    gbn01    0                0        13.34    0                   0                      1
16/10/2018 08:27    gbn01    1                1        15.84    15.84               15.84                  15.84
16/10/2018 09:27    gbn01    1                2        19.14    19.14               34.98                  17.49
16/10/2018 10:27    gbn01    1                3        11.64    11.64               46.62                  15.54
16/10/2018 11:27    gbn01    1                4        17.54    17.54               64.16                  16.04
16/10/2018 12:27    gbn01    1                5        20.84    20.84               85                     17
16/10/2018 13:27    gbn01    1                6        50.05    50.05               135.05                 22.51
16/10/2018 14:27    gbn01    1                7        10.05    10.05               145.1                  20.73
16/10/2018 15:27    gbn01    1                8        13.35    13.35               158.45                 19.81
16/10/2018 16:27    gbn01    1                9        32.55    32.55               191                    21.22
16/10/2018 17:27    gbn01    1                10       13.36    13.36               204.36                 20.44
16/10/2018 18:27    gbn01    0                0        10.86    0                   0                      1
16/10/2018 19:27    gbn01    0                0        20.06    0                   0                      1

Desired Output

I'm attempting to use the above to generate a new df that looks like the below:

#I've simplified it to a single condition and store
datetimeindex       store    working_hour_mean_sales
13/10/2018 07:27    gbn01    1
13/10/2018 08:27    gbn01    1
13/10/2018 09:27    gbn01    1
13/10/2018 10:27    gbn01    1
13/10/2018 11:27    gbn01    1
13/10/2018 12:27    gbn01    1
13/10/2018 13:27    gbn01    1
13/10/2018 14:27    gbn01    1
13/10/2018 15:27    gbn01    1
13/10/2018 16:27    gbn01    1
13/10/2018 17:27    gbn01    1
13/10/2018 18:27    gbn01    1
13/10/2018 19:27    gbn01    1
#Above weekday_mean_sales=1 because 13/10/2018 was a weekend                         
16/10/2018 07:27    gbn01    1
16/10/2018 08:27    gbn01    15.84
16/10/2018 09:27    gbn01    17.49
16/10/2018 10:27    gbn01    15.54
16/10/2018 11:27    gbn01    16.04
16/10/2018 12:27    gbn01    17
16/10/2018 13:27    gbn01    22.51
16/10/2018 14:27    gbn01    20.73
16/10/2018 15:27    gbn01    19.81
16/10/2018 16:27    gbn01    21.22
16/10/2018 17:27    gbn01    20.44
16/10/2018 18:27    gbn01    1
16/10/2018 19:27    gbn01    1

Where "working hours" are 08:00-18:00 Mon-Fri and "weekday lunch peak" is 12:00-13:30.

(NB I didn't make the counter-intuitive decision (at least to me) that weekday=0 means mon-fri)

Any suggestions how to implement this into pandas would be greatly appreciated!

You can use groupby() , agg() and between() .

This will aggregate the results for week day lunch peaks Mon-Fri:

df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('12:00:00','13:30:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})

And this will aggregate the results for working hours Mon-Fri:

df[(df['datetimeindex'].dt.strftime('%H:%M:%S').between('08:00:00','18:00:00')) & (df['weekday']==0)].groupby(['store','category']).agg({'sale': 'mean'})

Try separating your data to batches and then sum everything you need for every batch. At the end join the results, divide by number of entries and put the results in the columns you need.

Also you can batch data in number of ways, but as per your example, I suggest to group it by category and calculate everything for each of the categories and then join it in the final table.

I hope this helps you :)

this should guide you with the logic you need. Basically, you define a new columns workinghours , weekdaylunchpeak and use sqlcode to aggregate (there are other methods).

import pandasql as ps
import datetime
import numpy as np

mydata = pd.DataFrame(data={'datetimeindex': ['13/10/2018 09:27:01','13/10/2018 09:27:02','13/10/2018 09:27:03','13/10/2018 09:27:04','16/10/2018 11:59:01','16/10/2018 11:59:02','16/10/2018 11:59:03','16/10/2018 11:59:04','16/10/2018 21:05:01','16/10/2018 22:43:01'],
                       'store': ['gbn01','gbn02','gbn03','gbn05','gbn01','gbn02','gbn03','gbn05','gbn03','gbn01'],                        
                       'sale': [59.99,19.99,15.99,39.99,19.99,49.99,10,89.99,25,10.05],
                       'category': ['sporting','sporting','hygine','camping','other','sporting','food','electro','alcohol','health'],
                       'weekday': [1,1,1,1,0,0,0,0,0,0] 
                       })

mydata['datetimeindex'] = pd.to_datetime(mydata['datetimeindex'])
mydata['workinghours']=(
    np.where((mydata.datetimeindex.dt.time >= time(8,00))
             &
             (mydata.datetimeindex.dt.time<=time(18,00))
             &
             (mydata.weekday==0)
             , 1, 0))
mydata['weekdaylunchpeak']=(
    np.where((mydata.datetimeindex.dt.time >= time(12,00))
             &
             (mydata.datetimeindex.dt.time<=time(13,30))
             &
             (mydata.weekday==0)
             , 1, 0))

sqlcode = '''
SELECT 
    store,   
    category,
    avg(case when workinghours=1 then sale else 0 end) AS working_hours_mean_sales,
    avg(case when weekdaylunchpeak=1 then sale else 0 end) AS weekday_lunch_peak_mean_sales    
FROM mydata 

GROUP BY
store,   
    category

;
'''
newdf = ps.sqldf(sqlcode,locals()) 
newdf

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM