[英]One hot encoding categorical features to use as training data with numerical features in sklearn
I am trying to train a model that reads data from a csv as the training data. 我正在尝试训练一个从csv读取数据作为训练数据的模型。 To do this I am trying to conduct one hot encoding on the categorical features, and then pass the resulting arrays of 1s and 0s in as features, along with just the vanilla numerical features.
为此,我尝试对分类特征进行一次热编码,然后将所得的1和0数组作为特征以及仅普通数字特征传递。
I have the following code: 我有以下代码:
X = pd.read_csv('Data2Cut.csv')
Y = X.select_dtypes(include=[object])
le = preprocessing.LabelEncoder()
Y_2 = Y.apply(le.fit_transform)
enc = preprocessing.OneHotEncoder()
enc.fit(Y_2)
onehotlabels = enc.transform(Y_2).toarray()
onehotlabels.shape
features = []
labels = []
mycsv = csv.reader(open('Data2Cut.csv'))
indexCount = 0
for row in mycsv:
if indexCount < 8426:
features.append([onehotlabels[indexCount], row[1], row[2], row[3], row[6], row[8], row[9], row[10], row[11]])
labels.append(row[12])
indexCount = indexCount + 1
training_data = np.array(features, dtype = 'float_')
training_labels = np.array(labels, dtype = 'float_')
log = linear_model.LogisticRegression()
log = log.fit(training_data, training_labels)
joblib.dump(log, "modelLogisticRegression.pkl")
It seems to be getting to the line: 似乎已经到了线:
training_data = np.array(features, dtype = 'float_')
Before it crashes giving the following error: 在崩溃之前,出现以下错误:
ValueError: setting an array element with a sequence.
I figure this is a result of the one hot encoded values being arrays and not floats. 我认为这是由于一个热编码值是数组而不是浮点数的结果。 How can I change/tweak this code to handle the categorical and numerical features as training data?
如何更改/调整此代码以将分类和数字特征作为训练数据处理?
Edit: an example of a row i am feeding in, where each column is a feature is: 编辑:我要输入的一行示例,其中每一列都是一个功能:
mobile, 1498885897, 17491407, 23911, west coast, 2, seagull, 18, 41.0666666667, [0.325, 0.35], [u'text', u'font', u'writing', u'line'], 102, 5
#...
You must have already found your answer, but I am posting my findings (I was struggling with the same) here for people who have the same question. 您一定已经找到了答案,但是我在这里为有相同问题的人发布了我的发现(我一直在努力)。 The way to achieve this is to append the columns of the resulting encoded sparse matrix to your training dataframe instead.
实现此目的的方法是将生成的编码后的稀疏矩阵的列附加到训练数据帧中。 Eg (ignore the mistake in the price in the first row):
例如(忽略第一行中的价格错误):
This is of course a practical solution if you do not have too many unique values in your categories. 如果您的类别中没有太多唯一值,那么这当然是一种实用的解决方案。 You could look into more advanced encoding methods such as Backward Difference Coding or Polynomial Coding for cases where your categorical features can take many different values.
对于分类特征可以采用许多不同值的情况,您可以研究更高级的编码方法,例如后向差分编码或多项式编码 。
Which version of sklearn are you using? 您正在使用哪个版本的sklearn?
I see that in sklearn version 0.18.1 , passing 1d arrays as data is deprecated and gives a warning as below and does not give the desired result. 我看到在sklearn版本0.18.1中,不赞成将1d数组作为数据传递,并发出如下警告,但未给出期望的结果。
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. DeprecationWarning:在0.17中弃用1d数组作为数据,它将在0.19中引发ValueError。 Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
如果数据具有单个功能,则使用X.reshape(-1,1)来重塑数据,如果包含单个样本,则使用X.reshape(1,-1)来重塑数据。 DeprecationWarning)
弃用警告)
Try replacing the following line of code 尝试替换下面的代码行
onehotlabels = enc.transform(Y_2).toarray()
to one below 到下面的一个
onehotlabels = enc.transform(Y_2.reshape((-1,1)).toarray()
or you may use pd.get_dummies
to get the one hot coded feature matrix. 或者您可以使用
pd.get_dummies
获得一个热编码特征矩阵。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.