简体   繁体   English

多列上的 Pandas get_dummies

[英]Pandas get_dummies on multiple columns

I have a dataset with multiple columns that I wish to one hot encode.我有一个包含多个列的数据集,我希望对其进行一次热编码。 However, I don't want to have the encoding for each one of them since said columns are related to the said items.但是,我不想为它们中的每一个都设置编码,因为所述列与所述项目相关。 What I want is one "set" of dummies variables that uses all the columns.我想要的是使用所有列的一组“虚拟变量”。 See my code for a better explanation.请参阅我的代码以获得更好的解释。

Suppose my dataframe looks like this:假设我的数据框如下所示:

In [103]: dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})

In [104]: dum
Out[104]:
 ch1 ch2 ch3
0   A   B   C
1   C   G   D
2   A   F   E

If I execute如果我执行

pd.get_dummies(dum)

The output will be输出将是

   ch1_A  ch1_C  ch2_B  ch2_F  ch2_G  ch3_C  ch3_D  ch3_E
 0      1      0      1      0      0      1      0      0
 1      0      1      0      0      1      0      1      0
 2      1      0      0      1      0      0      0      1

However, what I would like to obtain is something like this:但是,我想获得的是这样的:

 A B C D E F G
 1 1 1 0 0 0 0
 0 0 1 1 0 0 1
 1 0 0 0 1 1 0

Instead of having multiple columns representing the encoding, eg ch1_A and ch1_C , I only wish to have one group ( A , B , and so on) with value 1 when any of the values in the columns ch1 , ch2 , ch3 show up.而不是有多个列表示编码,例如ch1_Ach1_C ,我只希望当ch1ch2ch3列中的任何值出现时,只有一个组( AB等)的值为1

To clarify, in my original dataset, a single row won't contain the same value (A,B,C...) more than once;澄清一下,在我的原始数据集中,单行不会多次包含相同的值 (A,B,C...); it will just appear on one of the columns.它只会出现在其中一列上。

Using stack and str.get_dummies使用stackstr.get_dummies

dum.stack().str.get_dummies().sum(level=0)
Out[938]: 
   A  B  C  D  E  F  G
0  1  1  1  0  0  0  0
1  0  0  1  1  0  0  1
2  1  0  0  0  1  1  0

You could use pd.crosstab to create a frequency table:您可以使用pd.crosstab创建频率表:

import pandas as pd

dum = pd.DataFrame({'ch1': ['A', 'C', 'A'], 'ch2': ['B', 'G', 'F'], 'ch3': ['C', 'D', 'E']})

stacked = dum.stack()
index = stacked.index.get_level_values(0)
result = pd.crosstab(index=index, columns=stacked)
result.index.name = None
result.columns.name = None

print(result)

yields产量

   A  B  C  D  E  F  G
0  1  1  1  0  0  0  0
1  0  0  1  1  0  0  1
2  1  0  0  0  1  1  0

Call it this way这样称呼

x = pd.get_dummies(dum, prefix="", prefix_sep="")

And then print using然后打印使用

print(x.to_string(index=False))

You can create dummies for separate columns and concat the results:您可以为单独的列创建虚拟对象并连接结果:

temp = pd.concat([pd.get_dummies(dum[col]) for col in dum], axis=1)

    A   C   B   F   G   C   D   E
0   1   0   1   0   0   1   0   0
1   0   1   0   0   1   0   1   0
2   1   0   0   1   0   0   0   1

temp.groupby(level=0, axis=1).sum()

    A   B   C   D   E   F   G
0   1   1   1   0   0   0   0
1   0   0   1   1   0   0   1
2   1   0   0   0   1   1   0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM