I have a dask dataframe with npartition=8, here is the snapshot of the data:
id1 id2 Page_nbr record_type
St1 Sc1 3 START
Sc1 St1 5 ADD
Sc1 St1 9 OTHER
Sc2 St2 34 START
Sc2 St2 45 DURATION
Sc2 St2 65 END
Sc3 Sc3 4 START
I want to add a column after record_type and add a unique group_id based on the condition of record type, so till the next record_type=START add the same unique group_id, output will look like below:
id1 id2 Page_nbr record_type group_id
St1 Sc1 3 START 1
Sc1 St1 5 ADD 1
Sc1 St1 9 OTHER 1
Sc2 St2 34 START 2
Sc2 St2 45 DURATION 2
Sc2 St2 65 END 2
Sc3 Sc3 4 START 3
The group_id can be any unique number. As the dataframe is huge iterating over rows may not be the best option. Wondering if there is any pythonic way to do so?
Take the "record_type" column, compare to "START", and then compute the cumsum
:
ddf['group_id'] = ddf['record_type'].eq('START').cumsum()
ddf.compute()
id1 id2 Page_nbr record_type group_id
0 St1 Sc1 3 START 1
1 Sc1 St1 5 ADD 1
2 Sc1 St1 9 OTHER 1
3 Sc2 St2 34 START 2
4 Sc2 St2 45 DURATION 2
5 Sc2 St2 65 END 2
6 Sc3 Sc3 4 START 3
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.