Group, index, and compute size of consecutive duplicates

Question

Consider a list of values, which can be anything, for example a dataframe of strings:

  Value
0  "a"
1  "b"
2  "b"
3  "c"
4  "d"
5  "e"
6  "e"
7  "e"
8  "f"
9  "f"

I would like to produce two new columns (in an efficient manner because my actual dataframe has millions of rows), grpidx the local index in the group of consecutive equal values, and grpsize the group size in every row of the group:

  Value grpidx grpsize
0  "a"     0      1
1  "b"     0      2
2  "b"     1      2
3  "c"     0      1
4  "d"     0      1
5  "e"     0      3
6  "e"     1      3
7  "e"     2      3
8  "f"     0      2
9  "f"     1      2

I manage to have approximate results, but there is always something wrong and I don't manage to have the grpsize working properly.

Note: I don't want the resulting final dataframe to have any actual group or subindexing (grouping being done only during the calculation as temporaries).

Answer 1

You can use:

# group by consecutive values
g = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum())

# enumeration per group
df['grpidx'] = g.cumcount()

# size of each group
df['grpsize'] = g['Value'].transform('size')

Output:

  Value  grpidx  grpsize
0     a       0        1
1     b       0        2
2     b       1        2
3     c       0        1
4     d       0        1
5     e       0        3
6     e       1        3
7     e       2        3
8     f       0        2
9     f       1        2

Group, index, and compute size of consecutive duplicates

Question

1 answers

solution1
2 2022-08-20 18:22:54

Group, index, and compute size of consecutive duplicates

Question

1 answers

solution1 2 2022-08-20 18:22:54

solution1
2 2022-08-20 18:22:54