[英]Convert categorical data into numerical data in Python
I have a data set.我有一个数据集。 One of its columns - "Keyword" - contains categorical data.
其中一列 - “关键字” - 包含分类数据。 The machine learning algorithm that I am trying to use takes only numeric data.
我尝试使用的机器学习算法只需要数字数据。 I want to convert "Keyword" column into numeric values - How can I do that?
我想将“关键字”列转换为数值 - 我该怎么做? Using NLP?
使用 NLP? Bag of words?
词袋?
I tried the following but I got ValueError: Expected 2D array, got 1D array instead
.我尝试了以下但我得到了
ValueError: Expected 2D array, got 1D array instead
。
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
dataset['Keyword'] = count_vector.fit_transform(dataset['Keyword'])
from sklearn.model_selection import train_test_split
y=dataset['C']
x=dataset(['Keyword','A','B'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)
You probably want to use an Encoder.您可能想使用编码器。 One of the most used and popular ones are
LabelEncoder
and OneHotEncoder
.最常用和流行的之一是
LabelEncoder
和OneHotEncoder
。 Both are provided as parts of sklearn
library.两者都作为
sklearn
库的一部分提供。
LabelEncoder can be used to transform categorical data into integers: LabelEncoder可用于将分类数据转换为整数:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
x = ['Apple', 'Orange', 'Apple', 'Pear']
y = label_encoder.fit_transform(x)
print(y)
array([0, 1, 0, 2])
This would transform a list of ['Apple', 'Orange', 'Apple', 'Pear'] into [0, 1, 0, 2] with each integer corresponding to an item.这会将 ['Apple', 'Orange', 'Apple', 'Pear'] 列表转换为 [0, 1, 0, 2] ,每个 integer 对应于一个项目。 This is not always ideal for ML as the integers have different numerical values, suggesting that one is bigger than the other, with, for example Pear > Apple, which is not at all the case.
这对于 ML 来说并不总是理想的,因为整数具有不同的数值,这表明一个大于另一个,例如 Pear > Apple,情况并非如此。 To not introduce this kind of problem you'd want to use OneHotEncoder.
为了不引入此类问题,您需要使用 OneHotEncoder。
OneHotEncoder can be used to transform categorical data into one hot encoded array. OneHotEncoder可用于将分类数据转换为一个热编码数组。 Encoding previously defined
y
by using OneHotEncoder would result in:使用 OneHotEncoder 对先前定义的
y
进行编码将导致:
from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
y = y.reshape(len(y), 1)
onehot_encoded = onehot_encoder.fit_transform(y)
print(onehot_encoded)
[[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]
Where each element of x
turns into an array of zeroes and just one 1
which encodes the category of the element. x
的每个元素变成一个零数组,只有一个1
编码元素的类别。
A simple tutorial on how to use this on a DataFrame can be found here . 可以在此处找到有关如何在 DataFrame 上使用此功能的简单教程。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.