简体   繁体   English

在 Python 中将分类数据转换为数值数据

[英]Convert categorical data into numerical data in Python

I have a data set.我有一个数据集。 One of its columns - "Keyword" - contains categorical data.其中一列 - “关键字” - 包含分类数据。 The machine learning algorithm that I am trying to use takes only numeric data.我尝试使用的机器学习算法只需要数字数据。 I want to convert "Keyword" column into numeric values - How can I do that?我想将“关键字”列转换为数值 - 我该怎么做? Using NLP?使用 NLP? Bag of words?词袋?

I tried the following but I got ValueError: Expected 2D array, got 1D array instead .我尝试了以下但我得到了ValueError: Expected 2D array, got 1D array instead

from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer()
dataset['Keyword'] = count_vector.fit_transform(dataset['Keyword'])
from sklearn.model_selection import train_test_split
y=dataset['C']
x=dataset(['Keyword','A','B'])
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train,y_train)

You probably want to use an Encoder.您可能想使用编码器。 One of the most used and popular ones are LabelEncoder and OneHotEncoder .最常用和流行的之一是LabelEncoderOneHotEncoder Both are provided as parts of sklearn library.两者都作为sklearn库的一部分提供。

LabelEncoder can be used to transform categorical data into integers: LabelEncoder可用于将分类数据转换为整数:

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
x = ['Apple', 'Orange', 'Apple', 'Pear']
y = label_encoder.fit_transform(x)
print(y)

array([0, 1, 0, 2])

This would transform a list of ['Apple', 'Orange', 'Apple', 'Pear'] into [0, 1, 0, 2] with each integer corresponding to an item.这会将 ['Apple', 'Orange', 'Apple', 'Pear'] 列表转换为 [0, 1, 0, 2] ,每个 integer 对应于一个项目。 This is not always ideal for ML as the integers have different numerical values, suggesting that one is bigger than the other, with, for example Pear > Apple, which is not at all the case.这对于 ML 来说并不总是理想的,因为整数具有不同的数值,这表明一个大于另一个,例如 Pear > Apple,情况并非如此。 To not introduce this kind of problem you'd want to use OneHotEncoder.为了不引入此类问题,您需要使用 OneHotEncoder。

OneHotEncoder can be used to transform categorical data into one hot encoded array. OneHotEncoder可用于将分类数据转换为一个热编码数组。 Encoding previously defined y by using OneHotEncoder would result in:使用 OneHotEncoder 对先前定义的y进行编码将导致:

from numpy import array
from numpy import argmax
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
y = y.reshape(len(y), 1)
onehot_encoded = onehot_encoder.fit_transform(y)
print(onehot_encoded)

[[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]

Where each element of x turns into an array of zeroes and just one 1 which encodes the category of the element. x的每个元素变成一个零数组,只有一个1编码元素的类别。

A simple tutorial on how to use this on a DataFrame can be found here . 可以在此处找到有关如何在 DataFrame 上使用此功能的简单教程。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM