numpy数组转换错误

Question

I have a dataset with string and float data. 我有一个包含字符串和浮点数据的数据集。 numPy tries to convert everything to a float, giving the error "cannot convert string to float" numPy尝试将所有内容转换为浮点数，并出现错误“无法将字符串转换为浮点数”

import numpy as np
import scipy
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

pd.set_option('display.height', 750)
pd.set_option('display.width', 750)

colnames = ['AGE', 'WORKCLASS', 'FNLWGT','EDU','EDU-NUM','MARITAL- 
STATUS','JOB','RELATIONSHIP','RACE', 'SEX', 'CAPITAL-GAIN', 'CAPITAL- 
LOSS','HOURS-PER-WEEK', 'NATIVE-COUNTRY', 'INCOME']
url = 'https://archive.ics.uci.edu/ml/machine-learning- 
databases/adult/adult.data'
adults = pd.read_csv(url, names=colnames, header=None)

adults['CAPITAL-GAINS'] = (adults['CAPITAL-GAIN'] - adults['CAPITAL-LOSS'])

adults = adults.drop(['RELATIONSHIP', 'FNLWGT', 'EDU-NUM', 'MARITAL-STATUS', 
'CAPITAL-GAIN', 'CAPITAL-LOSS'], axis=1)
#rearrange the columns to make it easier to set X
adults = adults[['AGE', 'WORKCLASS','EDU','JOB','RACE', 'SEX','HOURS-PER- 
WEEK', 'NATIVE-COUNTRY', 'CAPITAL-GAINS', 'INCOME']]
adults.replace({'?': 0}, inplace=True)
#assign the X and y arrays using numpy
X = np.array(adults.ix[:,0:9])
y = np.array(adults['INCOME'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
knn = KNeighborsClassifier()
knn.fit(X_train ,y_train)
pred = knn.predict(X_test)
print (accuracy_score(y_test, pred))

traceback: 追溯：

Traceback (most recent call last):
  File "C:/Users/nolan/OneDrive/Desktop/digits.py", line 37, in <module>
    knn.fit(X_train ,y_train)
  File "C:\Program Files\Python\lib\site-packages\sklearn\neighbors\base.py", line 765, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)
  File "C:\Program Files\Python\lib\site-packages\sklearn\utils\validation.py", line 573, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\Program Files\Python\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: ' Peru'

all the data looks like this: 所有数据如下所示：

39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0

is there a way to set numPy to hold this data with the conversion error? 有没有办法设置numPy来保存具有转换错误的数据？

Answer 1

There is not any numpy conversion error here; 这里没有任何numpy转换错误； the issue is simply than the k-nn algorithm cannot handle categorical features. 问题很简单，就是k-nn算法无法处理分类特征。 It is true that this is not explicitly mentioned in the scikit-learn documentation , but it follows directly if you have even a rough idea of what the algorithm does, which is computing distances between the data points, so that it can subsequently find the k nearest ones, hence the name. 的确，在scikit-learn 文档中没有明确提及这一点，但是，如果您甚至对该算法的作用有一个粗略的了解，即计算数据点之间的距离，以便可以随后找到k，则可以直接使用。最近的，因此得名。 And since there is not any (simple & general) way to compute distances between categorical features, the algorithm is simply not applicable in such cases. 而且由于没有任何（简单且通用）的方法来计算分类特征之间的距离，因此该算法根本不适用于此类情况。

See also this answer at Data Science Stack Exchange. 另请参阅数据科学堆栈交换中的此答案。

Answer 2

you should change the classifier, if possible. 如果可能，您应该更改分类器。 SVM and neural networks support this type of data, but KNN not suport this. SVM和神经网络支持这种类型的数据，但是KNN不支持这种数据。

numpy数组转换错误

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-04-05 23:46:15

解决方案2
0 2018-04-06 00:07:14

numpy数组转换错误

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-04-05 23:46:15

解决方案2 0 2018-04-06 00:07:14

解决方案1
2 已采纳 2018-04-05 23:46:15

解决方案2
0 2018-04-06 00:07:14