[英]How to perform OneHotEncoding in Sklearn, getting value error
I just started learning machine learning, when practicing one of the task, I am getting value error, but I followed the same steps as the instructor does.我刚开始学习机器学习,在练习其中一项任务时,我遇到了价值错误,但我遵循了与讲师相同的步骤。
I am getting value error, please help.我收到值错误,请帮忙。
dff天涯
Country Name
0 AUS Sri
1 USA Vignesh
2 IND Pechi
3 USA Raj
First I performed labelencoding,首先我执行了标签编码,
X=dff.values
label_encoder=LabelEncoder()
X[:,0]=label_encoder.fit_transform(X[:,0])
out:
X
array([[0, 'Sri'],
[2, 'Vignesh'],
[1, 'Pechi'],
[2, 'Raj']], dtype=object)
then performed One hot encoding for the same X然后对同一个 X 进行一次热编码
onehotencoder=OneHotEncoder( categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
I am getting the below error:我收到以下错误:
ValueError Traceback (most recent call last)
<ipython-input-472-be8c3472db63> in <module>()
----> 1 X=onehotencoder.fit_transform(X).toarray()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
1900 """
1901 return _transform_selected(X, self._fit_transform,
-> 1902 self.categorical_features, copy=True)
1903
1904 def _transform(self, X):
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
1695 X : array or sparse matrix, shape=(n_samples, n_features_new)
1696 """
-> 1697 X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
1698
1699 if isinstance(selected, six.string_types) and selected == "all":
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
380 force_all_finite)
381 else:
--> 382 array = np.array(array, dtype=dtype, order=order, copy=copy)
383
384 if ensure_2d:
ValueError: could not convert string to float: 'Raj'
Please edit my question is anything wrong, thanks in advance!请编辑我的问题有什么问题,提前致谢!
You can go directly to OneHotEncoding now without using the LabelEncoder , and as we move toward version 0.22 many might want to do things this way to avoid warnings and potential errors ( see DOCS and EXAMPLES ).您现在可以直接转到OneHotEncoding而不使用LabelEncoder ,随着我们向 0.22 版迈进,许多人可能希望以这种方式做事以避免警告和潜在错误(请参阅文档和示例)。
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
countries = np.unique(X[:,0])
names = np.unique(X[:,1])
ohe = OneHotEncoder(categories=[countries, names])
X = ohe.fit_transform(X).toarray()
print (X)
[[1. 0. 0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 0. 1.]
[0. 1. 0. 1. 0. 0. 0.]
[0. 0. 1. 0. 1. 0. 0.]]
The first 3 columns encode the country names, the last four the personal names.前 3 列对国家名称进行编码,后四列对个人名称进行编码。
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
ohe = OneHotEncoder(categories='auto')
X = ohe.fit_transform(X).toarray()
print (X)
[[1. 0. 0. 0. 0. 1. 0.]
[0. 0. 1. 0. 0. 0. 1.]
[0. 1. 0. 1. 0. 0. 0.]
[0. 0. 1. 0. 1. 0. 0.]]
Now, here's the unique part.现在,这是独特的部分。 What if you only need to One Hot Encode a specific column for your data?
如果您只需要对数据的特定列进行一次热编码怎么办?
( Note: I've left the last column as strings for easier illustration. In reality it makes more sense to do this WHEN the last column was already numerical). (注意:为了便于说明,我将最后一列保留为字符串。实际上,当最后一列已经是数字时,这样做更有意义)。
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
countries = np.unique(X[:,0])
names = np.unique(X[:,1])
ohe = OneHotEncoder(categories=[countries]) # specify ONLY unique country names
tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()
X = np.append(tmp, names.reshape(-1,1), axis=1)
print (X)
[[1.0 0.0 0.0 'Pechi']
[0.0 0.0 1.0 'Raj']
[0.0 1.0 0.0 'Sri']
[0.0 0.0 1.0 'Vignesh']]
Below implementation should work well.下面的实现应该运行良好。 Note that the input of onehotencoder
fit_transform
must not be 1-rank array and also output is sparse and we have used to_array()
to expand it.请注意,onehotencoder
fit_transform
的输入不能是 1-rank 数组,并且输出也是稀疏的,我们已经使用to_array()
来扩展它。
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]
df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values
le = LabelEncoder()
X_num = le.fit_transform(X[:,0]).reshape(-1,1)
ohe = OneHotEncoder()
X_num = ohe.fit_transform(X_num)
print (X_num.toarray())
X[:,0] = X_num
print (X)
An alternative if you do want to encode multiple categorical features is to use a Pipeline with a FeatureUnion and a couple custom Transformers.如果您确实想对多个分类特征进行编码,另一种方法是使用带有 FeatureUnion 和几个自定义转换器的管道。
First need two transformers - one for selecting a single column and one for making LabelEncoder usable in a Pipeline (The fit_transform method only takes X, it needs to take an optional y to work in a Pipeline).首先需要两个转换器 - 一个用于选择单个列,另一个用于使 LabelEncoder 在管道中可用(fit_transform 方法只需要 X,它需要一个可选的 y 才能在管道中工作)。
from sklearn.base import BaseEstimator, TransformerMixin
class SingleColumnSelector(TransformerMixin, BaseEstimator):
def __init__(self, column):
self.column = column
def transform(self, X, y=None):
return X[:, self.column].reshape(-1, 1)
def fit(self, X, y=None):
return self
class PipelineAwareLabelEncoder(TransformerMixin, BaseEstimator):
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return LabelEncoder().fit_transform(X).reshape(-1, 1)
Next create a Pipeline (or just a FeatureUnion) which has 2 branches - one for each of the categorical columns.接下来创建一个管道(或只是一个 FeatureUnion),它有 2 个分支 - 每个分类列一个。 Within each select 1 column, encode the labels and then one hot encode.
在每个选择 1 列中,对标签进行编码,然后进行一次热编码。
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
pipeline = Pipeline([(
'encoded_features',
FeatureUnion([('countries',
make_pipeline(
SingleColumnSelector(0),
PipelineAwareLabelEncoder(),
OneHotEncoder()
)),
('names', make_pipeline(
SingleColumnSelector(1),
PipelineAwareLabelEncoder(),
OneHotEncoder()
))
]))
])
Finally run your full dataframe through the Pipeline - it will one hot encode each column separately and concatenate at the end.最后通过管道运行完整的数据帧 - 它将分别对每一列进行热编码并在最后连接。
df = pd.DataFrame([["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]], columns=['Country', 'Name'])
X = df.values
transformed_X = pipeline.fit_transform(X)
print(transformed_X.toarray())
Which returns (first 3 columns are the countries, second 4 are the names)哪个返回(前 3 列是国家/地区,后 4 列是名称)
[[ 1. 0. 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0. 0. 1.]
[ 0. 1. 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 1. 0. 0.]]
to make the long story short, if you are looking to dummify your df, use dummy=pd.get_dummies
as:长话短说,如果您想将 df
dummy=pd.get_dummies
,请使用dummy=pd.get_dummies
为:
dummy=pd.get_dummies(df['str'])
df=pd.concat([df,dummy], axis=1)
print(Data)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.