简体   繁体   English

numpy数组转换错误

[英]Numpy array conversion error

I have a dataset with string and float data. 我有一个包含字符串和浮点数据的数据集。 numPy tries to convert everything to a float, giving the error "cannot convert string to float" numPy尝试将所有内容转换为浮点数,并出现错误“无法将字符串转换为浮点数”

import numpy as np
import scipy
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

pd.set_option('display.height', 750)
pd.set_option('display.width', 750)

colnames = ['AGE', 'WORKCLASS', 'FNLWGT','EDU','EDU-NUM','MARITAL- 
STATUS','JOB','RELATIONSHIP','RACE', 'SEX', 'CAPITAL-GAIN', 'CAPITAL- 
LOSS','HOURS-PER-WEEK', 'NATIVE-COUNTRY', 'INCOME']
url = 'https://archive.ics.uci.edu/ml/machine-learning- 
databases/adult/adult.data'
adults = pd.read_csv(url, names=colnames, header=None)

adults['CAPITAL-GAINS'] = (adults['CAPITAL-GAIN'] - adults['CAPITAL-LOSS'])

adults = adults.drop(['RELATIONSHIP', 'FNLWGT', 'EDU-NUM', 'MARITAL-STATUS', 
'CAPITAL-GAIN', 'CAPITAL-LOSS'], axis=1)
#rearrange the columns to make it easier to set X
adults = adults[['AGE', 'WORKCLASS','EDU','JOB','RACE', 'SEX','HOURS-PER- 
WEEK', 'NATIVE-COUNTRY', 'CAPITAL-GAINS', 'INCOME']]
adults.replace({'?': 0}, inplace=True)
#assign the X and y arrays using numpy
X = np.array(adults.ix[:,0:9])
y = np.array(adults['INCOME'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
knn = KNeighborsClassifier()
knn.fit(X_train ,y_train)
pred = knn.predict(X_test)
print (accuracy_score(y_test, pred))

traceback: 追溯:

Traceback (most recent call last):
  File "C:/Users/nolan/OneDrive/Desktop/digits.py", line 37, in <module>
    knn.fit(X_train ,y_train)
  File "C:\Program Files\Python\lib\site-packages\sklearn\neighbors\base.py", line 765, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)
  File "C:\Program Files\Python\lib\site-packages\sklearn\utils\validation.py", line 573, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\Program Files\Python\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: ' Peru'

all the data looks like this: 所有数据如下所示:

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0

is there a way to set numPy to hold this data with the conversion error? 有没有办法设置numPy来保存具有转换错误的数据?

There is not any numpy conversion error here; 这里没有任何numpy转换错误; the issue is simply than the k-nn algorithm cannot handle categorical features. 问题很简单,就是k-nn算法无法处理分类特征。 It is true that this is not explicitly mentioned in the scikit-learn documentation , but it follows directly if you have even a rough idea of what the algorithm does, which is computing distances between the data points, so that it can subsequently find the k nearest ones, hence the name. 的确,在scikit-learn 文档中没有明确提及这一点,但是,如果您甚至对该算法的作用有一个粗略的了解,即计算数据点之间的距离 ,以便可以随后找到k,则可以直接使用。最近的,因此得名。 And since there is not any (simple & general) way to compute distances between categorical features, the algorithm is simply not applicable in such cases. 而且由于没有任何(简单且通用)的方法来计算分类特征之间的距离,因此该算法根本不适用于此类情况。

See also this answer at Data Science Stack Exchange. 另请参阅数据科学堆栈交换中的此答案

you should change the classifier, if possible. 如果可能,您应该更改分类器。 SVM and neural networks support this type of data, but KNN not suport this. SVM和神经网络支持这种类型的数据,但是KNN不支持这种数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM