简体   繁体   English

基于条件的 Pandas DataFrame groupby

[英]Pandas DataFrame groupby based on condition

The most similar question I found was here but with no proper answer.我发现的最相似的问题在这里,但没有正确的答案。

Basically I have an issue where I'm trying to use groupby on a dataframe to generate unique IDs for bus routes.基本上我有一个问题,我试图在数据帧上使用 groupby 来为公交路线生成唯一的 ID。 The problem is, the data I have at my disposal sometimes (though rarely) has the same values for my groupby columns, so they're considered the same bus even though they aren't.问题是,我掌握的数据有时(尽管很少)对于我的 groupby 列具有相同的值,因此即使它们不是,它们也被视为相同的总线。

The only other way I can think of is to group buses based on another column called "Type of stop", where there is an indicator for Start, Middle and End.我能想到的唯一另一种方法是根据另一列名为“停止类型”的列对公交车进行分组,其中有一个开始、中间和结束的指示器。 I'd like to use groupby to create groups based on this column where each group starts where "type of stop" = Start, and ends where "type of stop" = End.我想使用 groupby 基于此列创建组,其中每个组从“停止类型”= 开始,并在“停止类型”= 结束时结束。

Consider the following data:考虑以下数据:

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
    'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3)})

   Cond   Position
0     A   START
1     A   MID  
2     A   MID   
3     A   END    
4     A   MID    
5     A   START   
6     A   START   
7     A   MID    
8     A   MID    
9     A   END    
10    A   MID   
11    A   START    
12    A   START    
13    A   MID    
14    A   MID    
15    A   END     
16    A   MID    
17    A   START

The only way I came up with to accurately group these buses together is to generate an additional column with the bus sequence id, but given that I'm working with lots of data, this isn't a very efficient solution.我想出的将这些总线准确组合在一起的唯一方法是生成一个带有总线序列 ID 的附加列,但鉴于我正在处理大量数据,这不是一个非常有效的解决方案。 I'm hoping to be able to manage what I want to do with a single groupby, if possible, in order to generate the following output如果可能的话,我希望能够管理我想要用单个 groupby 做的事情,以便生成以下输出

   Cond   Position   Group
0     A   START      1
1     A   MID        1
2     A   MID        1
3     A   END        1
4     A   MID        
5     A   START      2
6     A   START      2
7     A   MID        2
8     A   MID        2
9     A   END        2 
10    A   MID        
11    A   START      3
12    A   START      3 
13    A   MID        3
14    A   MID        3
15    A   END        3 
16    A   MID        
17    A   START      4

One idea is to factorize via np.select , then use a custom loop via numba :一种想法是通过np.select分解,然后通过numba使用自定义循环:

from numba import njit

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
                   'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})

@njit
def grouper(pos):
    res = np.empty(pos.shape)
    num = 1
    started = 0
    for i in range(len(res)):
        current_pos = pos[i]
        if (started == 0) and (current_pos == 0):
            started = 1
            res[i] = num
        elif (started == 1) and (current_pos == 1):
            started = 0
            res[i] = num
            num += 1
        elif (started == 1) and (current_pos in [-1, 0]):
            res[i] = num
        else:
            res[i] = 0
    return res

arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)

df['Group'] = grouper(arr).astype(int)

Result:结果:

print(df)

   Position Vehicle_ID  Group
0     START          A      1
1       MID          A      1
2       MID          A      1
3       END          A      1
4       MID          A      0
5     START          A      2
6     START          A      2
7       MID          A      2
8       MID          A      2
9       END          A      2
10      MID          A      0
11    START          A      3
12    START          A      3
13      MID          A      3
14      MID          A      3
15      END          A      3
16      MID          A      0
17    START          A      4

In my opinion, you should not include "blank" values as this would force your series to be object dtype, inefficient for any subsequent processing.在我看来,你应该包括“空白”的价值观,因为这将迫使你的序列是object D型,低效的任何后续处理。 As above, you can use 0 instead.如上所述,您可以使用0代替。

Performance benchmarking性能基准测试

numba is around ~10x faster than one pure Pandas approach:- numba比纯 Pandas 方法快约 10 倍:-

import pandas as pd, numpy as np
from numba import njit

df = pd.DataFrame({'Vehicle_ID': ['A']*18,
                   'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})


df = pd.concat([df]*10, ignore_index=True)

assert joz(df.copy()).equals(jpp(df.copy()))

%timeit joz(df.copy())  # 18.6 ms per loop
%timeit jpp(df.copy())  # 1.95 ms per loop

Benchmarking functions:基准测试功能:

def joz(df):
    # identification of sequences
    df['Position_Prev'] = df['Position'].shift(1)
    df['Sequence'] = 0
    df.loc[(df['Position'] == 'START') & (df['Position_Prev'] != 'START'), 'Sequence'] = 1
    df.loc[df['Position'] == 'END', 'Sequence'] = -1
    df['Sequence_Sum'] = df['Sequence'].cumsum()
    df.loc[df['Sequence'] == -1, 'Sequence_Sum'] = 1

    # take only items between START and END and generate Group number
    df2 = df[df['Sequence_Sum'] == 1].copy()
    df2.loc[df['Sequence'] == -1, 'Sequence'] = 0
    df2['Group'] = df2['Sequence'].cumsum()

    # merge results to one dataframe
    df = df.merge(df2[['Group']], left_index=True, right_index=True, how='left')
    df['Group'] = df['Group'].fillna(0)
    df['Group'] = df['Group'].astype(int)
    df.drop(['Position_Prev', 'Sequence', 'Sequence_Sum'], axis=1, inplace=True)    
    return df

@njit
def grouper(pos):
    res = np.empty(pos.shape)
    num = 1
    started = 0
    for i in range(len(res)):
        current_pos = pos[i]
        if (started == 0) and (current_pos == 0):
            started = 1
            res[i] = num
        elif (started == 1) and (current_pos == 1):
            started = 0
            res[i] = num
            num += 1
        elif (started == 1) and (current_pos in [-1, 0]):
            res[i] = num
        else:
            res[i] = 0
    return res

def jpp(df):
    arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)
    df['Group'] = grouper(arr).astype(int)
    return df

I have some solution.我有一些解决办法。 You have to avoid loops and try to using sliding, slicing and merging.您必须避免循环并尝试使用滑动、切片和合并。

This is my first prototype (should be refactored)这是我的第一个原型(应该重构)

# identification of sequences
df['Position_Prev'] = df['Position'].shift(1)
df['Sequence'] = 0
df.loc[(df['Position'] == 'START') & (df['Position_Prev'] != 'START'), 'Sequence'] = 1
df.loc[df['Position'] == 'END', 'Sequence'] = -1
df['Sequence_Sum'] = df['Sequence'].cumsum()
df.loc[df['Sequence'] == -1, 'Sequence_Sum'] = 1

# take only items between START and END and generate Group number
df2 = df[df['Sequence_Sum'] == 1].copy()
df2.loc[df['Sequence'] == -1, 'Sequence'] = 0
df2['Group'] = df2['Sequence'].cumsum()

# merge results to one dataframe
df = df.merge(df2[['Group']], left_index=True, right_index=True, how='left')
df['Group'] = df['Group'].fillna(0)
df['Group'] = df['Group'].astype(int)
df.drop(columns=['Position_Prev', 'Sequence', 'Sequence_Sum'], inplace=True)
df

Result:结果:

Vehicle_ID Position  Group
0           A    START      1
1           A      MID      1
2           A      MID      1
3           A      END      1
4           A      MID      0
5           A    START      2
6           A    START      2
7           A      MID      2
8           A      MID      2
9           A      END      2
10          A      MID      0
11          A    START      3
12          A    START      3
13          A      MID      3
14          A      MID      3
15          A      END      3
16          A      MID      0
17          A    START      4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM