简体   繁体   English

Pandas:组内的行数和另一个组内的行数

[英]Pandas: number rows within group cumulatively and across another group

Given the following dataframe:给定以下 dataframe:

    col_1 col_2 col_3
0     1     A     1
1     1     B     1
2     2     A     3
3     2     A     3
4     2     A     3
5     2     B     3
6     2     B     3
7     2     B     3
8     3     A     2
9     3     A     2
10    3     C     2
11    3     C     2

I need to create a new column in which the rows are numbered cumulatively within each group formed by 'col_1' and 'col_2', but also cumulatively after each group of 'col_1', like this:我需要创建一个新列,其中行在由“col_1”和“col_2”形成的每个组中累积编号,但也在每组“col_1”之后累积编号,如下所示:

    col_1 col_2 col_3  new
0     1     A     1     1
1     1     B     1     1
2     2     A     3     2
3     2     A     3     3
4     2     A     3     4
5     2     B     3     2
6     2     B     3     3
7     2     B     3     4
8     3     A     2     5
9     3     A     2     6
10    3     C     2     5
11    3     C     2     6

I've tried:我试过了:

df['new'] = df.groupby(['col_1', 'col_2']).cumcount() + 1

But this doesn't add up from the previous group as intended.但这并没有按预期从前一组中加起来。

This is a tricky problem.这是一个棘手的问题。 You want to calculate the cumcount within group, but for all subsequent groups you need to keep track of how much that was already incremented so you know the offset to apply.您想计算组内的 cumcount,但对于所有后续组,您需要跟踪已经增加了多少,以便知道要应用的偏移量。 That can be done with a max + cumsum of this cumcount over the previous groups.这可以通过这个cumcountmax + cumsum超过之前的组来完成。 Here the only complication is that you need to determine the relationship between previous and subsequent group labels, in case there isn't some simple + 1 increment between labels of susbequent groups.这里唯一的复杂之处是您需要确定先前和后续组标签之间的关系,以防后续组的标签之间没有一些简单的 + 1 增量。

# Cumcount within group
s = df.groupby(['col_1', 'col_2']).cumcount()

# Determine how many cumcounts were within all previous groups of `col_1' 
to_merge = s.add(1).groupby(df['col_1']).max().cumsum().add(1).to_frame('new')

# Link group with prior group label
df1 = df[['col_1']].drop_duplicates()
df1['col_1_shift'] = df1['col_1'].shift(-1)
df1 = pd.concat([to_merge, df1.set_index('col_1')], axis=1)

# Bring the group offset over
df = df.merge(df1, left_on='col_1', right_on='col_1_shift', how='left')

# Add the group offset to the cumulative count within group.
# First group (no previous group) is NaN so fill with 1.
df['new'] = df['new'].fillna(1, downcast='infer') + s

# Clean up merging column
df = df.drop(columns='col_1_shift')

    col_1 col_2  col_3  new
0       1     A      1    1
1       1     B      1    1
2       2     A      3    2
3       2     A      3    3
4       2     A      3    4
5       2     B      3    2
6       2     B      3    3
7       2     B      3    4
8       3     A      2    5
9       3     A      2    6
10      3     C      2    5
11      3     C      2    6

You can use two consecutive groupby , one on the two columns, the second on the first group only by col_1:您可以使用两个连续的groupby ,一个在两列上,第二个在第一组上只能通过 col_1:

# classical cumcount per group
count1 = df.groupby(['col_1', 'col_2']).cumcount().add(1)
# max cumcount per group
g = count1.groupby(df['col_1']) # (*) read below
count2 = g.ngroup().map(g.max().cumsum()).fillna(0, downcast='infer')
# add the two
df['new'] = count1+count2

### Note (*)
## if df['col_1'] is not of the form 1/2/3...
## use this to group instead:
# group = df['col_1'].ne(df['col_1'].shift()).cumsum()
# g = count1.groupby(group)

output: output:

    col_1 col_2  col_3  new
0       1     A      1    1
1       1     B      1    1
2       2     A      3    2
3       2     A      3    3
4       2     A      3    4
5       2     B      3    2
6       2     B      3    3
7       2     B      3    4
8       3     A      2    5
9       3     A      2    6
10      3     C      2    5
11      3     C      2    6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM