简体   繁体   English

对 df 中的唯一值执行 groupby 计数的有效方法

[英]Efficient method to perform a groupby count on unique values in a df

The code below aims to return a df that counts the number of points positive and negative FROM a reference point (mainX, mainY) .下面的代码旨在返回一个df ,该df计算来自参考点(mainX, mainY) This is determined by the Direction .这是由Direction决定的。 These are separated into two groups (I, J) .它们分为两组(I, J) The points are located in X,Y with each having a relative Label .这些点位于X,Y ,每个点都有一个相对的Label

So I split the points up into their respective groups.所以我把这些点分成各自的组。 I then subset the df into positive/negative df's using a query.然后我使用查询将df子集为正/负 df。 These df's are then grouped by time and counted to a separate column.这些 df 然后按时间分组并计入单独的列。 These df's are then concatenated.然后将这些 df 连接起来。

All this seems to be very inefficient.所有这些似乎都非常低效。 Especially if I have numerous unique values in Group .特别是如果我在Group有许多独特的值。 For example, I have to replicate the querying sequence onward to return counts for Group J .例如,我必须复制查询序列以返回Group J计数。

Is there a more efficient way to accomplish the intended output?有没有更有效的方法来完成预期的输出?

import pandas as pd

df = pd.DataFrame({
        'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],
        'Group' : ['I','J','I','J','I','J','I','J','I','J','I','J'],                  
        'Label' : ['A','B','C','D','E','F','A','B','C','D','E','F'],                 
        'X' : [8,4,3,8,7,4,2,3,3,4,6,1],
        'Y' : [3,6,4,8,5,2,8,8,2,4,5,1],
        'mainX' : [5,5,5,5,5,5,5,5,5,5,5,5],
        'mainY' : [5,5,5,5,5,5,5,5,5,5,5,5],
        'Direction' : ['Left','Right','Left','Right','Left','Right','Left','Right','Left','Right','Left','Right']
    })

# Determine amount of unique groups
Groups = df['Group'].unique().tolist()

# Subset groups into separate df's
Group_I = df.loc[df['Group'] == Groups[0]]
Group_J = df.loc[df['Group'] == Groups[1]]


# Separate into positive and negative direction for each group    
GroupI_Pos = Group_I.query("(Direction == 'Right' and X > mainX) or (Direction == 'Left' and X < mainX)").copy()
GroupI_Neg = Group_I.query("(Direction == 'Right' and X < mainX) or (Direction == 'Left' and X > mainX)").copy()

# Count of items per timestamp for Group I
GroupI_Pos['GroupI_Positive_Count'] = GroupI_Pos.groupby(['Time'])['Time'].transform('count')   
GroupI_Neg['GroupI_Negative_Count'] = GroupI_Neg.groupby(['Time'])['Time'].transform('count')   

# Combine Positive/Negative dfs
df_I = pd.concat([GroupI_Pos, GroupI_Neg], sort = False).sort_values(by = 'Time')

# Forward fill Nan grouped by time
df_I = df_I.groupby(['Time']).ffill()

Intended Output:预期输出:

          Time Group Label  X  Y  mainX  mainY Direction  GroupI_Positive_Count  GroupI_Negative_Count  GroupJ_Positive_Count  GroupJ_Negative_Count
0   09:00:00.1     I     A  8  3      5      5      Left                      1                      2                      1                      2
1   09:00:00.1     J     B  4  6      5      5     Right                      1                      2                      1                      2
2   09:00:00.1     I     C  3  4      5      5      Left                      1                      2                      1                      2
3   09:00:00.1     J     D  8  8      5      5     Right                      1                      2                      1                      2
4   09:00:00.1     I     E  7  5      5      5      Left                      1                      2                      1                      2
5   09:00:00.1     J     F  4  2      5      5     Right                      1                      2                      1                      2
6   09:00:00.2     I     A  2  8      5      5      Left                      2                      1                      0                      3
7   09:00:00.2     J     B  3  8      5      5     Right                      2                      1                      0                      3
8   09:00:00.2     I     C  3  2      5      5      Left                      2                      1                      0                      3
9   09:00:00.2     J     D  4  4      5      5     Right                      2                      1                      0                      3
10  09:00:00.2     I     E  6  5      5      5      Left                      2                      1                      0                      3
11  09:00:00.2     J     F  1  1      5      5     Right                      2                      1                      0                      3
I used [numpy.select][1] to filter based on the conditions, 
pivot table gets us the count of positive and negatives
and then merge the tables using the join method.

pos1 = (df.Direction=='Right') & (df.X.ge(df.mainX))
pos2 = (df.Direction=='Left') & (df.X.le(df.mainX))
neg1 = (df.Direction=='Right') & (df.X.le(df.mainX))
neg2 = (df.Direction=='Left') & (df.X.ge(df.mainX))
cond_list = [(pos1|pos2),(neg1|neg2)]
choice_list = ['pos','neg']

df['choices'] = np.select(cond_list,choice_list)

R = df.copy().pivot_table(index='Time',
                          columns= 'Group','choices'],values='Label',
                          aggfunc='count')

R.columns = R.columns.to_flat_index()

#better than hardcoding the columns
R.columns = ['Group'+'_'.join(i)+'_count' for i in R.columns]


df
.set_index('Time')
.join(R).fillna(0)
.reset_index()
.drop('choices',axis=1)


Time Group Label  X  Y  mainX  mainY Direction  \
0   09:00:00.1     I     A  8  3      5      5      Left   
1   09:00:00.1     J     B  4  6      5      5     Right   
2   09:00:00.1     I     C  3  4      5      5      Left   
3   09:00:00.1     J     D  8  8      5      5     Right   
4   09:00:00.1     I     E  7  5      5      5      Left   
5   09:00:00.1     J     F  4  2      5      5     Right   
6   09:00:00.2     I     A  2  8      5      5      Left   
7   09:00:00.2     J     B  3  8      5      5     Right   
8   09:00:00.2     I     C  3  2      5      5      Left   
9   09:00:00.2     J     D  4  4      5      5     Right   
10  09:00:00.2     I     E  6  5      5      5      Left   
11  09:00:00.2     J     F  1  1      5      5     Right   

GroupI_neg_count  GroupI_pos_count  GroupJ_neg_count  \
0                     2.0                    1.0                    2.0   
1                     2.0                    1.0                    2.0   
2                     2.0                    1.0                    2.0   
3                     2.0                    1.0                    2.0   
4                     2.0                    1.0                    2.0   
5                     2.0                    1.0                    2.0   
6                     1.0                    2.0                    3.0   
7                     1.0                    2.0                    3.0   
8                     1.0                    2.0                    3.0   
9                     1.0                    2.0                    3.0   
10                    1.0                    2.0                    3.0   
11                    1.0                    2.0                    3.0   

GroupJ_pos_count  
0                     1.0  
1                     1.0  
2                     1.0  
3                     1.0  
4                     1.0  
5                     1.0  
6                     0.0  
7                     0.0  
8                     0.0  
9                     0.0  
10                    0.0  
11                    0.0  

Here my take on it这是我的看法

s = (((df.Direction.eq('Right') & df.X.gt(df.mainX)) | 
      (df.Direction.eq('Left')  & df.X.lt(df.mainX)))
     .replace({True: 'Pos', False: 'Neg'}))

df_count = df.groupby(['Time', 'Group', s]).size().unstack([1, 2], fill_value=0)
df_count.columns = df_count.columns.map(lambda x: f'Group{x[0]}_{x[1]}')

df_final = df.merge(df_count, left_on='Time', right_index=True)

Out[521]:
          Time Group Label  X  Y  mainX  mainY Direction  GroupI_Neg  \
0   09:00:00.1     I     A  8  3      5      5      Left           2
1   09:00:00.1     J     B  4  6      5      5     Right           2
2   09:00:00.1     I     C  3  4      5      5      Left           2
3   09:00:00.1     J     D  8  8      5      5     Right           2
4   09:00:00.1     I     E  7  5      5      5      Left           2
5   09:00:00.1     J     F  4  2      5      5     Right           2
6   09:00:00.2     I     A  2  8      5      5      Left           1
7   09:00:00.2     J     B  3  8      5      5     Right           1
8   09:00:00.2     I     C  3  2      5      5      Left           1
9   09:00:00.2     J     D  4  4      5      5     Right           1
10  09:00:00.2     I     E  6  5      5      5      Left           1
11  09:00:00.2     J     F  1  1      5      5     Right           1

    GroupI_Pos  GroupJ_Neg  GroupJ_Pos
0            1           2           1
1            1           2           1
2            1           2           1
3            1           2           1
4            1           2           1
5            1           2           1
6            2           3           0
7            2           3           0
8            2           3           0
9            2           3           0
10           2           3           0
11           2           3           0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM