I have a dataframe with several categorical columns, and I want to aggregate all these into a single categorical column, preferably using Pandas.
For an example, if I have two columns, named category1 (c1) and category2 (c2), both with data that range from 0 to 2, I want to aggregate them in some other column category (c), which can range from 0 to 5, representing all the possible categorical values combinations.
I would go from this:
d1 d2 c1 c2
1 1 NA 0
2 1 1 1
3 1 0 2
4 2 2 NA
5 1 NA NA
6 2 2 2
7 2 0 NA
8 2 0 2
To this:
d1 d2 c
1 1 0
2 1 1
3 1 2
4 2 3
5 1 4
6 2 5
7 2 6
8 2 2
I tried following this , but it didn't seem to work and threw some errors, namely ValueError: cannot reindex from a duplicate axis.
I appreciate in advance any help.
IIUC, you can use ngroup
with groupby
.
df['c'] = df.fillna(-1).groupby(['c1', 'c2']).ngroup()
The order might be arbitrary (ie not same as yours), but hopefully that's not important.
d1 d2 c
0 1 1 1
1 2 1 4
2 3 1 3
3 4 2 5
4 5 1 0
5 6 2 6
6 7 2 2
7 8 2 3
We can chain wide_to_long
+ drop_duplicates
Newdf=pd.wide_to_long(df,['c'],i=['d1','d2'],j='drop').dropna().reset_index(level=[0,1]).drop_duplicates()
Newdf
Out[53]:
d1 d2 c
drop
2 1 1 0.0
1 2 1 1.0
1 3 1 0.0
2 3 1 2.0
1 4 2 2.0
1 6 2 2.0
1 7 2 0.0
1 8 2 0.0
2 8 2 2.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.