简体   繁体   中英

What is the fastest way to do inverse multi-hot encoding in pandas?

What is the fastest way to an inverse "multi-hot" (like one-hot with multiple simultaneous categories) operation on a large DataFrame?

I have the follow DataFrame:

id  type_A  type_B  type_C
 1       1       1       0
 2       0       1       0
 3       0       1       1

The operation would give:

id   type
 1 type_A
 1 type_B
 2 type_B
 3 type_B
 3 type_C

Using melt and query :

df = df.melt(id_vars='id', value_vars=['type_A', 'type_B', 'type_C']).query('value == 1')

   id variable  value
0   1   type_A      1
3   1   type_B      1
4   2   type_B      1
5   3   type_B      1
8   3   type_C      1

With correct column names:

df = (
    df.melt(id_vars='id', 
            value_vars=['type_A', 'type_B', 'type_C'],
            var_name='type')
      .query('value == 1')
      .drop(columns='value')
)

   id    type
0   1  type_A
3   1  type_B
4   2  type_B
5   3  type_B
8   3  type_C

melt should be the normal way to achieve this

yourdf=df.melt('id').loc[lambda x : x['value']==1]
   id variable  value
0   1   type_A      1
3   1   type_B      1
4   2   type_B      1
5   3   type_B      1
8   3   type_C      1

Here is a solution with .dot which uses matrix multiplication with the columns helped by series.explode() which is new in version 0.25+ :

m = df.set_index('id')
m.dot(m.columns+',').str.rstrip(',').str.split(',').explode().reset_index(name='type')

   id    type
0   1  type_A
1   1  type_B
2   2  type_B
3   3  type_B
4   3  type_C

Use:

new_df = (df.set_index('id')
            .where(lambda x: x.eq(1))
            .stack()
            .rename_axis(['id','type'])
            .reset_index()[['id','type']] )
print(new_df)
   id    type
0   1  type_A
1   1  type_B
2   2  type_B
3   3  type_B
4   3  type_C
df.melt(id_vars='id', ).query('value == 1').drop(columns='value').rename(columns={"variable":"type"})

desired result:

    id  type
0   1   type_A
3   1   type_B
4   2   type_B
5   3   type_B
8   3   type_C

You can replace all zeros with NaN and stack . By stacking all NaN values are dropped. Than you can get the MultiIndex and convert it into a data frame:

df = df.set_index('id') # set 'id' to index if necessary

df.replace(0, np.nan).stack().index.to_frame(index=False, name=['id', 'type'])

Output:

   id    type
0   1  type_A
1   1  type_B
2   2  type_B
3   3  type_B
4   3  type_C

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM