简体   繁体   English

一种热编码分类特征,用作sklearn中具有数字特征的训练数据

[英]One hot encoding categorical features to use as training data with numerical features in sklearn

I am trying to train a model that reads data from a csv as the training data. 我正在尝试训练一个从csv读取数据作为训练数据的模型。 To do this I am trying to conduct one hot encoding on the categorical features, and then pass the resulting arrays of 1s and 0s in as features, along with just the vanilla numerical features. 为此,我尝试对分类特征进行一次热编码,然后将所得的1和0数组作为特征以及仅普通数字特征传递。

I have the following code: 我有以下代码:

X = pd.read_csv('Data2Cut.csv')

Y = X.select_dtypes(include=[object])

le = preprocessing.LabelEncoder()

Y_2 = Y.apply(le.fit_transform)


enc = preprocessing.OneHotEncoder()

enc.fit(Y_2)

onehotlabels = enc.transform(Y_2).toarray()
onehotlabels.shape

features = []
labels = []
mycsv = csv.reader(open('Data2Cut.csv'))
indexCount = 0
for row in mycsv:
  if indexCount < 8426:
    features.append([onehotlabels[indexCount], row[1], row[2], row[3], row[6], row[8], row[9], row[10], row[11]])
    labels.append(row[12])
    indexCount = indexCount + 1

training_data = np.array(features, dtype = 'float_')
training_labels = np.array(labels, dtype = 'float_')

log = linear_model.LogisticRegression()
log = log.fit(training_data, training_labels)
joblib.dump(log, "modelLogisticRegression.pkl")

It seems to be getting to the line: 似乎已经到了线:

training_data = np.array(features, dtype = 'float_')

Before it crashes giving the following error: 在崩溃之前,出现以下错误:

ValueError: setting an array element with a sequence.

I figure this is a result of the one hot encoded values being arrays and not floats. 我认为这是由于一个热编码值是数组而不是浮点数的结果。 How can I change/tweak this code to handle the categorical and numerical features as training data? 如何更改/调整此代码以将分类和数字特征作为训练数据处理?

Edit: an example of a row i am feeding in, where each column is a feature is: 编辑:我要输入的一行示例,其中每一列都是一个功能:

mobile, 1498885897, 17491407,   23911,  west coast, 2,  seagull, 18,    41.0666666667,  [0.325, 0.35],  [u'text', u'font', u'writing', u'line'],    102, 5  
#...

You must have already found your answer, but I am posting my findings (I was struggling with the same) here for people who have the same question. 您一定已经找到了答案,但是我在这里为有相同问题的人发布了我的发现(我一直在努力)。 The way to achieve this is to append the columns of the resulting encoded sparse matrix to your training dataframe instead. 实现此目的的方法是将生成的编码后的稀疏矩阵的列附加到训练数据帧中。 Eg (ignore the mistake in the price in the first row): 例如(忽略第一行中的价格错误):

来源:https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f

This is of course a practical solution if you do not have too many unique values in your categories. 如果您的类别中没有太多唯一值,那么这当然是一种实用的解决方案。 You could look into more advanced encoding methods such as Backward Difference Coding or Polynomial Coding for cases where your categorical features can take many different values. 对于分类特征可以采用许多不同值的情况,您可以研究更高级的编码方法,例如后向差分编码多项式编码

Which version of sklearn are you using? 您正在使用哪个版本的sklearn?

I see that in sklearn version 0.18.1 , passing 1d arrays as data is deprecated and gives a warning as below and does not give the desired result. 我看到在sklearn版本0.18.1中,不赞成将1d数组作为数据传递,并发出如下警告,但未给出期望的结果。

DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. DeprecationWarning:在0.17中弃用1d数组作为数据,它将在0.19中引发ValueError。 Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample. 如果数据具有单个功能,则使用X.reshape(-1,1)来重塑数据,如果包含单个样本,则使用X.reshape(1,-1)来重塑数据。 DeprecationWarning) 弃用警告)

Try replacing the following line of code 尝试替换下面的代码行

onehotlabels = enc.transform(Y_2).toarray()

to one below 到下面的一个

onehotlabels = enc.transform(Y_2.reshape((-1,1)).toarray()

or you may use pd.get_dummies to get the one hot coded feature matrix. 或者您可以使用pd.get_dummies获得一个热编码特征矩阵。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 字符串分类功能的一种热编码 - One hot encoding of string categorical features 一种热编码“得到了一个意想不到的关键字参数‘categorical_features’” - One hot encoding "got an unexpected keyword argument 'categorical_features'" 有什么方法可以可视化决策树(sklearn),其中分类特征从一个热编码特征中合并而来? - Is there any way to visualize decision tree (sklearn) with categorical features consolidated from one hot encoded features? 在 GaussianNB 之前对数据进行预处理以对数值特征进行分类 - preprocessing data before GaussianNB for categorical for numerical features 具有一个热编码特征的Auto-Sklearn中的特征和特征重要性 - Features and Feature importance in Auto-Sklearn with One Hot Encoded Features 从 scikit-learn 中的 one-hot-encoding 回溯分类特征? - Backtracking categorical features from one-hot-encoding in scikit-learn? 编码要在KMeans集群中使用的分类功能 - Encoding categorical features to use in KMeans clustering 编码分类特征? - encoding categorical features? 当Training和Test中的功能数量不同时,如何在生产环境中处理One-Hot Encoding? - How to handle One-Hot Encoding in production environment when number of features in Training and Test are different? 如何在sklearn中编码分类特征? - How to encode categorical features in sklearn?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM