简体   繁体   中英

Pandas calculate based on multiple rows and conditions

I'm novoce to pandas. Need to calculate time for each person, for each location and drop rows without pair in dates col. My data looks like this:

Unit    Name    Location    Date    Time
0  K1  Somebody1    LOC1  2020-05-12  07:00
1  K1  Somebody1    LOC1  2020-05-12  20:10
2  K1  Somebody1    LOC1  2020-05-13  06:00
3  K1  Somebody1    LOC1  2020-05-13  20:00
4  K1  Somebody1    LOC1  2020-05-14  06:37
5  K1  Somebody1    LOC2  2020-05-15  07:00
6  K1  Somebody1    LOC2  2020-05-15  20:10
7  K1  Somebody1    LOC2  2020-05-16  06:00
8  K1  Somebody1    LOC2  2020-05-16  20:00
9  K1  Somebody1    LOC2  2020-05-17  06:37
10  K1  Somebody2    LOC2  2020-05-13  07:00
11  K1  Somebody2    LOC2  2020-05-14  10:10
12  K1  Somebody2    LOC2  2020-05-14  16:50
13  K1  Somebody2    LOC2  2020-05-15  05:36
14  K1  Somebody3    LOC1  2020-05-13  07:00
15  K1  Somebody3    LOC1  2020-05-14  10:10
16  K1  Somebody3    LOC1  2020-05-14  16:50
17  K1  Somebody3    LOC1  2020-05-15  05:36

I only menaged to convert time to datetime object by

df['Time'] = df['Time'].apply(lambda x: datetime.strptime(x,'%H:%M').time())

Tried using pivot tables, grouping by, for loops and I'm out of ideas. I wanted output to look like that:

LOC1
      Somebody1  2020-05-12  13h 10m
                 2020-05-13  14h 00m
TOTAL                        27h 00m
      Somebody2  date        hours
                 date        hours
TOTAL                        sum for somebody2
      Somebody3  date        hours
                 date        hours
TOTAL                        sum for somebody3

LOC2
      Somebody1  date        hours
                 date        hours
TOTAL                        sum for somebody1
      Somebody2  date        hours   
                 date        hours
TOTAL                        sum for somebody2

or something similar

IIUC groupby and combine first

import numpy as np
df['datetime'] = pd.to_datetime(df['Date'] + ' ' +  df['Time'])

df1 = df.groupby(['Name','Location', df['datetime'].dt.normalize()])\
                                  .agg(start=('datetime','first'),
                                   end=('datetime','last'))

df1['timespent'] = (df1['end'] - df1['start']) / np.timedelta64(1,'h')

# create total row.
m = df1.unstack(['Name','Location'])['timespent'].sum().unstack()
m = m.assign(TOTAL=m.sum(1)).stack().to_frame('timespent')



final = df1.drop(['start','end'],axis=1).combine_first(m)

#if you want to remove single entry days
final[final['timespent'] > 0]

                               timespent
Name      Location datetime             
Somebody1 LOC1     2020-05-12  13.166667
                   2020-05-13  14.000000
          TOTAL    NaT         27.166667
Somebody2 LOC2     2020-05-14   6.666667
          TOTAL    NaT          6.666667

You can begin with grep to collect times per two rows and then calculate the time difference. For example, parse the names of peoples into one list and then using grep do:

for i in $(cat list-names);do grep $i a.csv | awk '{print$6}';done 

where a.csv:

0  K1  Somebody1    LOC1  2020-05-12  17:00
1  K1  Somebody1    LOC1  2020-05-12  20:10

Also, to grab the difference in Hours do:

awk '
    NR == 1{old = $6; next}     
    {print $6 - old; old = $6}  
' a.csv

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM