Add a unique identifier in a new column until a condition met on another column

Question

I have a dask dataframe with npartition=8, here is the snapshot of the data:

      id1    id2     Page_nbr    record_type 
      St1    Sc1     3             START
      Sc1    St1     5              ADD      
      Sc1    St1     9             OTHER 
      Sc2    St2     34            START
      Sc2    St2     45           DURATION  
      Sc2    St2     65             END
      Sc3    Sc3     4              START

I want to add a column after record_type and add a unique group_id based on the condition of record type, so till the next record_type=START add the same unique group_id, output will look like below:

      id1    id2     Page_nbr    record_type     group_id
      St1    Sc1     3             START             1
      Sc1    St1     5              ADD              1    
      Sc1    St1     9             OTHER             1 
      Sc2    St2     34            START             2
      Sc2    St2     45           DURATION           2
      Sc2    St2     65             END              2
      Sc3    Sc3     4              START            3

The group_id can be any unique number. As the dataframe is huge iterating over rows may not be the best option. Wondering if there is any pythonic way to do so?

Answer 1

Take the "record_type" column, compare to "START", and then compute the cumsum :

ddf['group_id'] = ddf['record_type'].eq('START').cumsum()
ddf.compute()

   id1  id2  Page_nbr record_type  group_id
0  St1  Sc1         3       START         1
1  Sc1  St1         5         ADD         1
2  Sc1  St1         9       OTHER         1
3  Sc2  St2        34       START         2
4  Sc2  St2        45    DURATION         2
5  Sc2  St2        65         END         2
6  Sc3  Sc3         4       START         3

Add a unique identifier in a new column until a condition met on another column

Question

1 answers

solution1
0 ACCPTED 2019-02-25 23:04:57

Add a unique identifier in a new column until a condition met on another column

Question

1 answers

solution1 0 ACCPTED 2019-02-25 23:04:57

solution1
0 ACCPTED 2019-02-25 23:04:57