[英]How to fix the problem in pandas.get_dummies
I'm preprocessing my dataset with pd.get_dummies, but the result is not what I need.我正在使用 pd.get_dummies 预处理我的数据集,但结果不是我需要的。
Is it correct to use pd.get_dummies()?使用 pd.get_dummies() 是否正确? Or any approaches I can try?
或者我可以尝试的任何方法?
import pandas as pd
rawdataset=[['apple','banana','carrot','daikon','egg'],
['apple','banana'],
['apple','banana','carrot'],
['daikon','egg','fennel'],
['apple','banana','daikon']]
dataset=pd.DataFrame(data=rawdataset)
print(pd.get_dummies(dataset))
I expect it looks like this:我希望它看起来像这样:
apple banana carrot daikon egg fennel
0 1 1 1 1 1 0
1 1 1 0 0 0 0
........
not like this:不是这样的:
0_apple 0_daikon 1_banana 1_egg 2_carrot 2_daikon 2_fennel
0 1 0 1 0 1 0 0
1 1 0 1 0 0 0 0
....
Different ways to skin a cat.给猫剥皮的不同方法。
pd.get_dummies
and max
pd.get_dummies
和max
pd.get_dummies(dataset, prefix="", prefix_sep="").max(level=0, axis=1)
apple daikon banana egg carrot fennel
0 1 1 1 1 1 0
1 1 0 1 0 0 0
2 1 0 1 0 1 0
3 0 1 0 1 0 1
4 1 1 1 0 0 0
stack
, str.get_dummies
, and sum
/ max
: stack
、 str.get_dummies
和sum
/ max
:
df.stack().str.get_dummies().sum(level=0)
apple banana carrot daikon egg fennel
0 1 1 1 1 1 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
4 1 1 0 1 0 0
stack
and crosstab
stack
和crosstab
u = df.stack()
pd.crosstab(u.index.get_level_values(0), u)
col_0 apple banana carrot daikon egg fennel
row_0
0 1 1 1 1 1 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
4 1 1 0 1 0 0
Here you go:干得好:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
rawdataset=[['apple','banana','carrot','daikon','egg'],
['apple','banana'],
['apple','banana','carrot'],
['daikon','egg','fennel'],
['apple','banana','daikon']]
def dummy(doc):
return doc
count_vec = CountVectorizer(tokenizer=dummy, preprocessor=dummy)
count_vec.fit(rawdataset)
X = count_vec.transform(rawdataset).todense()
pd.DataFrame(X, columns=count_vec.get_feature_names())
Results in:结果是:
apple banana carrot daikon egg fennel
0 1 1 1 1 1 0
1 1 1 0 0 0 0
2 1 1 1 0 0 0
3 0 0 0 1 1 1
4 1 1 0 1 0 0
The added benefit here is that you can also apply it to unseen data as pd.get_dummies
cannot convert other unseen test data in the same way.这里的额外好处是您还可以将其应用于看不见的数据,因为
pd.get_dummies
无法以相同的方式转换其他看不见的测试数据。
Try:尝试:
unseen_raw_data = [["test"]]
feature_names = count_vec.get_feature_names()
unseen_data = count_vec.transform(unseen_raw_data).todense()
pd.DataFrame(unseen_data, columns=feature_names)
yields:产量:
apple banana carrot daikon egg fennel
0 0 0 0 0 0 0
which is the correct output这是正确的输出
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.