简体   繁体   中英

Group, index, and compute size of consecutive duplicates

Consider a list of values, which can be anything, for example a dataframe of strings:

  Value
0  "a"
1  "b"
2  "b"
3  "c"
4  "d"
5  "e"
6  "e"
7  "e"
8  "f"
9  "f"

I would like to produce two new columns (in an efficient manner because my actual dataframe has millions of rows), grpidx the local index in the group of consecutive equal values, and grpsize the group size in every row of the group:

  Value grpidx grpsize
0  "a"     0      1
1  "b"     0      2
2  "b"     1      2
3  "c"     0      1
4  "d"     0      1
5  "e"     0      3
6  "e"     1      3
7  "e"     2      3
8  "f"     0      2
9  "f"     1      2

I manage to have approximate results, but there is always something wrong and I don't manage to have the grpsize working properly.

Note: I don't want the resulting final dataframe to have any actual group or subindexing (grouping being done only during the calculation as temporaries).

You can use:

# group by consecutive values
g = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum())

# enumeration per group
df['grpidx'] = g.cumcount()

# size of each group
df['grpsize'] = g['Value'].transform('size')

Output:

  Value  grpidx  grpsize
0     a       0        1
1     b       0        2
2     b       1        2
3     c       0        1
4     d       0        1
5     e       0        3
6     e       1        3
7     e       2        3
8     f       0        2
9     f       1        2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM