[英]What is the fastest way to do inverse multi-hot encoding in pandas?
What is the fastest way to an inverse "multi-hot" (like one-hot with multiple simultaneous categories) operation on a large DataFrame?在大型 DataFrame 上进行反向“多热”(如具有多个同时类别的单热)操作的最快方法是什么?
I have the follow DataFrame:我有以下数据帧:
id type_A type_B type_C
1 1 1 0
2 0 1 0
3 0 1 1
The operation would give:该操作将给出:
id type
1 type_A
1 type_B
2 type_B
3 type_B
3 type_C
Using melt
and query
:使用
melt
和query
:
df = df.melt(id_vars='id', value_vars=['type_A', 'type_B', 'type_C']).query('value == 1')
id variable value
0 1 type_A 1
3 1 type_B 1
4 2 type_B 1
5 3 type_B 1
8 3 type_C 1
With correct column names:使用正确的列名:
df = (
df.melt(id_vars='id',
value_vars=['type_A', 'type_B', 'type_C'],
var_name='type')
.query('value == 1')
.drop(columns='value')
)
id type
0 1 type_A
3 1 type_B
4 2 type_B
5 3 type_B
8 3 type_C
melt should be the normal way to achieve this融化应该是实现这一目标的正常方法
yourdf=df.melt('id').loc[lambda x : x['value']==1]
id variable value
0 1 type_A 1
3 1 type_B 1
4 2 type_B 1
5 3 type_B 1
8 3 type_C 1
Here is a solution with .dot
which uses matrix multiplication with the columns helped by series.explode()
which is new in version 0.25+
:这是一个带有
.dot
的解决方案,它使用矩阵乘法与series.explode()
帮助的列,这是版本0.25+
新0.25+
:
m = df.set_index('id')
m.dot(m.columns+',').str.rstrip(',').str.split(',').explode().reset_index(name='type')
id type
0 1 type_A
1 1 type_B
2 2 type_B
3 3 type_B
4 3 type_C
Use:用:
new_df = (df.set_index('id')
.where(lambda x: x.eq(1))
.stack()
.rename_axis(['id','type'])
.reset_index()[['id','type']] )
print(new_df)
id type
0 1 type_A
1 1 type_B
2 2 type_B
3 3 type_B
4 3 type_C
df.melt(id_vars='id', ).query('value == 1').drop(columns='value').rename(columns={"variable":"type"})
desired result:想要的结果:
id type
0 1 type_A
3 1 type_B
4 2 type_B
5 3 type_B
8 3 type_C
You can replace all zeros with NaN
and stack
.您可以用
NaN
和stack
替换所有零。 By stacking all NaN
values are dropped.通过堆叠所有
NaN
值都将被丢弃。 Than you can get the MultiIndex
and convert it into a data frame:然后您可以获得
MultiIndex
并将其转换为数据框:
df = df.set_index('id') # set 'id' to index if necessary
df.replace(0, np.nan).stack().index.to_frame(index=False, name=['id', 'type'])
Output:输出:
id type
0 1 type_A
1 1 type_B
2 2 type_B
3 3 type_B
4 3 type_C
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.