[英]Pandas get_dummies on multiple columns
I have a dataset with multiple columns that I wish to one hot encode.我有一个包含多个列的数据集,我希望对其进行一次热编码。 However, I don't want to have the encoding for each one of them since said columns are related to the said items.
但是,我不想为它们中的每一个都设置编码,因为所述列与所述项目相关。 What I want is one "set" of dummies variables that uses all the columns.
我想要的是使用所有列的一组“虚拟变量”。 See my code for a better explanation.
请参阅我的代码以获得更好的解释。
Suppose my dataframe looks like this:假设我的数据框如下所示:
In [103]: dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})
In [104]: dum
Out[104]:
ch1 ch2 ch3
0 A B C
1 C G D
2 A F E
If I execute如果我执行
pd.get_dummies(dum)
The output will be输出将是
ch1_A ch1_C ch2_B ch2_F ch2_G ch3_C ch3_D ch3_E
0 1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 1 0
2 1 0 0 1 0 0 0 1
However, what I would like to obtain is something like this:但是,我想获得的是这样的:
A B C D E F G
1 1 1 0 0 0 0
0 0 1 1 0 0 1
1 0 0 0 1 1 0
Instead of having multiple columns representing the encoding, eg ch1_A
and ch1_C
, I only wish to have one group ( A
, B
, and so on) with value 1
when any of the values in the columns ch1
, ch2
, ch3
show up.而不是有多个列表示编码,例如
ch1_A
和ch1_C
,我只希望当ch1
、 ch2
、 ch3
列中的任何值出现时,只有一个组( A
、 B
等)的值为1
。
To clarify, in my original dataset, a single row won't contain the same value (A,B,C...) more than once;澄清一下,在我的原始数据集中,单行不会多次包含相同的值 (A,B,C...); it will just appear on one of the columns.
它只会出现在其中一列上。
Using stack
and str.get_dummies
使用
stack
和str.get_dummies
dum.stack().str.get_dummies().sum(level=0)
Out[938]:
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
You could use pd.crosstab
to create a frequency table:您可以使用
pd.crosstab
创建频率表:
import pandas as pd
dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})
stacked = dum.stack()
index = stacked.index.get_level_values(0)
result = pd.crosstab(index=index, columns=stacked)
result.index.name = None
result.columns.name = None
print(result)
yields产量
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
Call it this way这样称呼
x = pd.get_dummies(dum, prefix="", prefix_sep="")
And then print using然后打印使用
print(x.to_string(index=False))
You can create dummies for separate columns and concat the results:您可以为单独的列创建虚拟对象并连接结果:
temp = pd.concat([pd.get_dummies(dum[col]) for col in dum], axis=1)
A C B F G C D E
0 1 0 1 0 0 1 0 0
1 0 1 0 0 1 0 1 0
2 1 0 0 1 0 0 0 1
temp.groupby(level=0, axis=1).sum()
A B C D E F G
0 1 1 1 0 0 0 0
1 0 0 1 1 0 0 1
2 1 0 0 0 1 1 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.