简体   繁体   English

Python Pandas-将数据框中的2行结合起来-有条件

[英]Python Pandas - combining 2 lines from data frame - with condition

I have a Pandas data frame that looks like that: 我有一个看起来像这样的Pandas数据框:

A       B     C    Stime    Etime    
1220627 a   10.0 18:00:00 18:09:59
1220627 a   12.0 18:15:00 18:26:59
1220683 b   3.0  18:36:00 18:38:59
1220683 a   3.0  18:36:00 18:38:59
1220732 a   59.0 18:00:00 18:58:59
1220760 A   16.0 18:24:00 18:39:59
1220760 a   16.0 18:24:00 18:39:59
1220760 A   19.0 18:40:00 18:58:59
1220760 b   19.0 18:40:00 18:58:59
1220760 a   19.0 18:40:00 18:58:59
1220775 a   3.0  18:03:00 18:05:59

Stime and Etime cols are from type datetime. Stime和Etime cols来自日期时间类型。

C is the number of minutes between Stime and Etime. C是介于Stime和Etime之间的分钟数。

A col is household ID and B col is person ID in the household. A col是家庭ID,B col是家庭中的人ID。

(so that cols A and B together represent a unique person). (以便A和B列共同代表一个唯一的人)。

What I need to do is to update the table such that if, for a certain person, the Stime comes right after the end time - I will unit the 2 lines and I will update C. 我需要做的就是更新表,这样对于某个人来说,如果Stime在结束时间之后到来-我将两行合并,然后更新C。

for example here, for person a in HH 1220760 the first Etime is 18:39:59 例如这里,为的人a在HH 1220760第一Etime18:39:59

and the second Stime is 18:40:00 - which comes right after 18:39:59, so I would like to unit the lines and update C for this person to be 35 (16+19). 第二个Stime18:40:00 : Stime : 18:40:00在18:39:59之后,所以我想统一线并将此人的C更新为35 (16 + 19)。

I tried to use groupby but I don't know how to add the condition that Stime will come right after Etime . 我试图用groupby ,但我不知道如何添加条件Stime会之后来到Etime

If we add one second to Etime then we can find rows to be joined by grouping by ['A', 'B'] and then for each group comparing shifted Etime s with the next Stime : 如果我们增加一秒Etime那么我们可以找到行由通过分组被接合['A', 'B']然后对每个组比较移位Etime s的下一Stime

df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
#           A  B     C               Etime               Stime   keep
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True

We will want to keep rows where keep is True and remove rows where keep is False, except that we will also want to update the Etime s as appropriate. 我们将要保留keep为True的行,并删除keep为False的行,除了我们还要适当地更新Etime

It would be nice if we could assign a "group number" to each row so that we could group by ['A', 'B', 'group_number'] -- and in fact we can. 如果我们可以为每行分配一个“组号”,以便我们可以按['A', 'B', 'group_number']分组,那将是很好的-实际上我们可以。 All we need to do is apply cumsum to the keep column: 我们需要做的就是将cumsum应用于keep列:

df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
#           A  B     C               Etime               Stime   keep  group_number
# 0   1220627  a  10.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 1   1220627  a  12.0 2016-05-29 18:27:00 2016-05-29 18:15:00   True           2.0
# 3   1220683  a   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 2   1220683  b   3.0 2016-05-29 18:39:00 2016-05-29 18:36:00   True           1.0
# 4   1220732  a  59.0 2016-05-29 18:59:00 2016-05-29 18:00:00   True           1.0
# 5   1220760  A  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           1.0
# 7   1220760  A  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           1.0
# 12  1220760  a   0.0 2016-05-29 18:10:00 2016-05-29 18:00:00   True           1.0
# 6   1220760  a  16.0 2016-05-29 18:40:00 2016-05-29 18:24:00   True           2.0
# 9   1220760  a  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00  False           2.0
# 11  1220760  a  11.0 2016-05-29 19:10:00 2016-05-29 18:59:00  False           2.0
# 8   1220760  b  19.0 2016-05-29 18:59:00 2016-05-29 18:40:00   True           1.0
# 10  1220775  a   3.0 2016-05-29 18:06:00 2016-05-29 18:03:00   True           1.0

Now the desired result can be found by grouping by ['A', 'B', 'group_number'] , and finding the minimum Stime and maximum Etime for each group: 现在所希望的结果可通过分组中找到['A', 'B', 'group_number']并且找到最小Stime和最大Etime对于每个组:

result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})

                                     Stime               Etime
A       B group_number                                        
1220627 a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:15:00 2016-05-29 18:27:00
1220683 a 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
        b 1.0          2016-05-29 18:36:00 2016-05-29 18:39:00
1220732 a 1.0          2016-05-29 18:00:00 2016-05-29 18:59:00
1220760 A 1.0          2016-05-29 18:24:00 2016-05-29 18:59:00
        a 1.0          2016-05-29 18:00:00 2016-05-29 18:10:00
          2.0          2016-05-29 18:24:00 2016-05-29 19:10:00
        b 1.0          2016-05-29 18:40:00 2016-05-29 18:59:00
1220775 a 1.0          2016-05-29 18:03:00 2016-05-29 18:06:00

Putting it all together, 放在一起

import numpy as np
import pandas as pd

df = pd.DataFrame(
    {'A': [1220627, 1220627, 1220683, 1220683, 1220732, 1220760, 1220760,
           1220760, 1220760, 1220760, 1220775, 1220760, 1220760],
     'B': ['a', 'a', 'b', 'a', 'a', 'A', 'a', 'A', 'b', 'a', 'a', 'a', 'a'], 
     'C': [10.0, 12.0, 3.0, 3.0, 59.0, 16.0, 16.0, 19.0, 19.0, 19.0, 3.0, 11.0, 0], 
     'Stime': ['18:00:00', '18:15:00', '18:36:00', '18:36:00', '18:00:00',
               '18:24:00', '18:24:00', '18:40:00', '18:40:00', '18:40:00', 
               '18:03:00', '18:59:00', '18:00:00'],
     'Etime': ['18:09:59', '18:26:59', '18:38:59', '18:38:59', '18:58:59',
               '18:39:59', '18:39:59', '18:58:59', '18:58:59', '18:58:59', 
               '18:05:59', '19:09:59', '18:09:59'],})
for col in ['Stime', 'Etime']:
    df[col] = pd.to_datetime(df[col])
df['Etime'] += pd.Timedelta(seconds=1)
df = df.sort_values(by=['A', 'B', 'Stime'])
df['keep'] = df.groupby(['A','B'])['Etime'].shift(1) != df['Stime']
df['group_number'] = df.groupby(['A','B'])['keep'].cumsum()
result = df.groupby(['A','B', 'group_number']).agg({'Stime':'min', 'Etime':'max'})
result = result.reset_index()
result['C'] = (result['Etime']-result['Stime']).dt.total_seconds() / 60.0
result = result[['A', 'B', 'C', 'Stime', 'Etime']]
print(result)

yields 产量

         A  B     C               Stime               Etime
0  1220627  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
1  1220627  a  12.0 2016-05-29 18:15:00 2016-05-29 18:27:00
2  1220683  a   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
3  1220683  b   3.0 2016-05-29 18:36:00 2016-05-29 18:39:00
4  1220732  a  59.0 2016-05-29 18:00:00 2016-05-29 18:59:00
5  1220760  A  35.0 2016-05-29 18:24:00 2016-05-29 18:59:00
6  1220760  a  10.0 2016-05-29 18:00:00 2016-05-29 18:10:00
7  1220760  a  46.0 2016-05-29 18:24:00 2016-05-29 19:10:00
8  1220760  b  19.0 2016-05-29 18:40:00 2016-05-29 18:59:00
9  1220775  a   3.0 2016-05-29 18:03:00 2016-05-29 18:06:00

One of the advantages of using half-open intervals of the form [start, end) instead of fully-closed intervals [start, end] is that when two interval abut, the end of one equals the start of the next. 一个的使用形式的半开区间的优点[start, end)代替全闭的时间间隔的[start, end]是,当两个间隔邻接,所述end之一等于 start的下一个的。

Another advantage is that the number of minutes in a half-open interval equals end-start . 另一个优点是,半开间隔中的分钟数等于end-start With a fully-closed interval, the formula becomes end-start+1 . 在完全封闭的间隔内,公式变为end-start+1

Python's builtin range and list slicing syntax use half-open intervals for these same reasons . 出于这些相同的原因, Python的内置range和列表切片语法使用半开间隔。 So I would recommend using half-open intervals [Stime, Etime) in your DataFrame too. 因此,我建议您在DataFrame中也使用半开间隔[Stime, Etime) Etime]。

what about this approach? 那这种方法呢?

In [68]: df.groupby(['A','B', df.Stime - df['Etime'].shift() <= pd.Timedelta('1S')], as_index=False)['C'].sum()
Out[68]:
         A  B     C
0  1220627  a  22.0
1  1220683  a   3.0
2  1220683  b   3.0
3  1220732  a  59.0
4  1220760  A  35.0
5  1220760  a  35.0
6  1220760  b  19.0
7  1220775  a   3.0

Ok I think have a solution, but it is very crude and I'm sure someone can improve upon it. 好的,我认为有一个解决方案,但是它非常粗糙,我相信有人可以对此进行改进。

assuming df = the data you have provided above: 假设df =您上面提供的数据:

df['Stime'] = pd.to_datetime(df['Stime'], format='%H:%M:%S') # needs to be converted to datetime
df['Etime'] = pd.to_datetime(df['Etime'], format='%H:%M:%S') # needs to be converted to datetime

df = df.sort_values(['A','B','Stime']) # data needs to be sorted by unique person : Stime
df = df.reset_index(drop=True)
df = df.reset_index() 

def new_person(row):
    if row.name > 0:
        if row['A'] != df.ix[row.name-1][1] or row['B'] != df.ix[row.name-1][2]:
            return 'Yes'

def update(row):
    if row.name > 0:
        if row['B'] == df.ix[row.name-1][2]:
            if df.ix[row.name][4] - df.ix[row.name-1][5] >= pd.Timedelta(seconds=0) and df.ix[row.name][4] - df.ix[row.name-1][5] < pd.Timedelta(seconds=2):
                return df.groupby(['A','B'])['C'].cumsum().ix[row.name]

def rewrite(row):
    if row['update'] > 0:
        return row['update']
    else:
        return row['C']

df['new_person'] = df.apply(new_person, axis=1) # adds column where value = 'Yes' if person is not the same as row above
df['update'] = df.apply(update,axis=1) # adds a column 'update' to allow for a cumulative sum rewritten to 'C' in rewrite function
print df

df['Stime'] = pd.to_datetime(df['Stime'], format='%H:%M:%S').dt.time # removes date from datetime
df['Etime'] = pd.to_datetime(df['Etime'], format='%H:%M:%S').dt.time # removes date from datetime
df['C'] = df.apply(rewrite,axis=1) # rewrites values for 'C' column

# hacky way of combining idxmax and indices of rows where the person is 'new'
updated = df.groupby(['A','B'])['C'].agg(pd.Series.idxmax).values
not_updated = df['new_person'].isnull().tolist()

combined = [x for x in df.index if (x in updated or x in not_updated)]

df = df.iloc[combined]
df = df.drop(['new_person','update','index'],axis=1)
print df

Apologies for the extremely hacky answer, but I think it should achieve what you need. 很抱歉,这个答案很不客气,但我认为它应该可以满足您的需求。 Not sure how well it will work if your dataframe is very large though. 不确定如果您的数据帧很大,它将如何运作。

Resulting dataframe: 结果数据框:

          A  B   C     Stime     Etime
0   1220627  a  10  18:00:00  18:09:59
1   1220627  a  12  18:15:00  18:26:59
2   1220683  a   3  18:36:00  18:38:59
3   1220683  b   3  18:36:00  18:38:59
4   1220732  a  59  18:00:00  18:58:59
6   1220760  A  35  18:40:00  18:58:59
9   1220760  a  46  18:59:00  18:09:59
10  1220760  b  19  18:40:00  18:58:59
11  1220775  a   3  18:03:00  18:05:59

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM