[英]ValueError: Input contains NaN, infinity or a value too large for dtype('float64') using fit from KNeighborsRegressor
Prior to attempting the fit I have thoroughly cleaned my data frame and ensured that the entire data frame has no inf or NaN values and is composed of entirely non-null float64 values. 在尝试拟合之前,我已经彻底清理了数据框,并确保整个数据框没有inf或NaN值,并且完全由非null的float64值组成。 However, I still redundantly verified this using np.isinf(), df.isnull().sum() and df.info() methods.
但是,我仍然使用np.isinf(),df.isnull()。sum()和df.info()方法对此进行了冗余验证。 All my research showed that others with the same issue had NaN, inf, or object data type in their data frame.
我的所有研究表明,其他遇到相同问题的人在其数据框中具有NaN,inf或对象数据类型。 This is not so in my case.
就我而言,情况并非如此。 Lastly, I found a vaguely similar case which found a resolution using this code:
最后,我发现了一个类似的案例 ,该案例使用以下代码找到了解决方法:
df = df.apply(lambda x: pd.to_numeric(x,errors='ignore'))
This did not help in my situation. 这对我的情况没有帮助。 How can I resolve this ValueError exception?
如何解决此ValueError异常?
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
# Read csv file and assign column names
headers=['symboling','normalized_losses','make','fuel_type','aspiration','num_of_doors',
'body_style','drive_wheels','engine_location','wheel_base','length','width',
'height','curb_weight','engine_type','num_of_cylinders','engine_size','fuel_system',
'bore','stroke','compression_ratio','horsepower','peak_rpm','city_mpg','highway_mpg',
'price']
cars = pd.read_csv('imports-85.data.txt', names=headers)
# Select only the columns with continuous values from - https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
continuous_values_cols = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight',
'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
numeric_cars = cars[continuous_values_cols].copy()
# Clean Data Set by Convert missing values (?) with np.NaN then set the type to float
numeric_cars.replace(to_replace='?', value=np.nan, inplace=True)
numeric_cars = numeric_cars.astype('float')
# Because the column we're trying to predict is 'price', any row were price is NaN will be removed."
numeric_cars.dropna(subset=['price'], inplace=True)
# All remaining NaN's will be filled with the mean of its respective column
numeric_cars = numeric_cars.fillna(numeric_cars.mean())
# Create training feature list and k value list
test_features = numeric_cars.columns.tolist()
predictive_feature = 'price'
test_features.remove(predictive_feature)
k_values = [x for x in range(10) if x/2 != round(x/2)]
# Normalize columns
numeric_cars_normalized = numeric_cars[test_features].copy()
numeric_cars_normalized = numeric_cars_normalized/ numeric_cars.max()
numeric_cars_normalized[predictive_feature] = numeric_cars[predictive_feature].copy()
def knn_train_test(df, train_columns, predict_feature, k_value):
# Randomly resorts the DataFrame to mitigate sampling bias
np.random.seed(1)
df = df.loc[np.random.permutation(len(df))]
# Split the DataFrame into ~75% train / 25% test data sets
split_integer = round(len(df) * 0.75)
train_df = df.iloc[0:split_integer]
test_df = df.iloc[split_integer:]
train_features = train_df[train_columns]
train_target = train_df[predict_feature]
# Trains the model
knn = KNeighborsRegressor(n_neighbors=k_value)
knn.fit(train_features, train_target)
# Test the model & return calculate mean square error
predictions = knn.predict(test_df[train_columns])
print("predictions")
mse = mean_squared_error(y_true=test_df[predict_feature], y_pred=predictions)
return mse
# instantiate mse dict
mse_dict = {}
# test each feature and do so with a range of k values
# in an effot to determine the optimal training feature and k value
for feature in test_features:
mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
mse_dict[feature] = mse
print(mse_dict)
Here's the full error trace back: 这是完整的错误回溯:
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
Traceback (most recent call last):
File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 76, in <module>
mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 76, in <listcomp>
mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 60, in knn_train_test
knn.fit(train_features, train_target)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\neighbors\base.py", line 741, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 521, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 407, in check_array
_assert_all_finite(array)
File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 58, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Here's the code and output I used to verify that there are no NaN or inf values in my DataFrame: 这是我用来验证DataFrame中没有NaN或inf值的代码和输出:
# Verify data for NaN and inf
print(len(numeric_cars_normalized))
# 201
print(numeric_cars_normalized.info())
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 201 entries, 0 to 204
# Data columns (total 14 columns):
# bore 201 non-null float64
# city_mpg 201 non-null float64
# compression_ratio 201 non-null float64
# curb_weight 201 non-null float64
# height 201 non-null float64
# highway_mpg 201 non-null float64
# horsepower 201 non-null float64
# length 201 non-null float64
# normalized_losses 201 non-null float64
# peak_rpm 201 non-null float64
# price 201 non-null float64
# stroke 201 non-null float64
# wheel_base 201 non-null float64
# width 201 non-null float64
# dtypes: float64(14)
# memory usage: 23.6 KB
# None
print(numeric_cars_normalized.isnull().sum())
# bore 0
# city_mpg 0
# compression_ratio 0
# curb_weight 0
# height 0
# highway_mpg 0
# horsepower 0
# length 0
# normalized_losses 0
# peak_rpm 0
# price 0
# stroke 0
# wheel_base 0
# width 0
# dtype: int64
# The loop below, essentially does the same as the above
# verification, but using different methods
# the purpose is to prove there's no nan or inf in my data set
index = []
NaN_counter = []
inf_counter = []
for col in numeric_cars_normalized.columns:
index.append(col)
# inf counter
col_isinf = np.isinf(numeric_cars_normalized[col])
if col_isinf.value_counts().index[0] == False:
inf_counter.append(col_isinf.value_counts()[0])
# nan counter
col_isnan = np.isnan(numeric_cars_normalized[col])
if col_isnan.value_counts().index[0] == False:
NaN_counter.append(col_isnan.value_counts()[0])
data_check = {'NOT_NaN_count': NaN_counter, 'NOT_inf_count': inf_counter}
data_verification = pd.DataFrame(data=data_check, index=index)
print(data_verification)
# NOT_NaN_count NOT_inf_count
# bore 201 201
# city_mpg 201 201
# compression_ratio 201 201
# curb_weight 201 201
# height 201 201
# highway_mpg 201 201
# horsepower 201 201
# length 201 201
# normalized_losses 201 201
# peak_rpm 201 201
# price 201 201
# stroke 201 201
# wheel_base 201 201
# width 201 201
I may have found the problem, but still not sure how to fix it. 我可能已经找到了问题,但仍然不确定如何解决。
# Here's a another methodology for extra redudnant data checking
index = []
NaN_counter = []
inf_counter = []
for col in numeric_cars_normalized.columns:
index.append(col)
inf_counter.append(np.any(np.isfinite(numeric_cars_normalized[col])))
NaN_counter.append(np.any(np.isnan(numeric_cars_normalized[col])))
data_check = {'Any_NaN': NaN_counter, 'Any_inf': inf_counter}
data_verification = pd.DataFrame(data=data_check, index=index)
print(data_verification)
Any_NaN Any_inf
# bore False True
# city_mpg False True
# compression_ratio False True
# curb_weight False True
# height False True
# highway_mpg False True
# horsepower False True
# length False True
# normalized_losses False True
# peak_rpm False True
# price False True
# stroke False True
# wheel_base False True
# width False True
So clearly I have inf in my DataSet, but I'm not sure why or how to fix it. 很明显,我在DataSet中有inf,但是我不确定为什么或如何修复它。
The problem that seems you are having comes from the permutation that you are doing, by commenting these two lines: 通过注释以下两行,您似乎遇到的问题来自您所做的排列:
# np.random.seed(1)
# df = df.loc[np.random.permutation(len(df))]
This is because when you clean your data, you end up with only 201 rows from 204 of them. 这是因为当您清理数据时,最终只能得到204行中的201行。 By debugging the dataframe that you provide to the knn function, you can find that indeed, three of the rows are now 'nan' for all columns once the numeric_cars_normalized have been permuted.
通过调试提供给knn函数的数据帧,您可以发现,确实,一旦对numeric_cars_normalized进行了置换,所有列的三行现在都是“ nan”。
and rerunning the code, you will obtain results. 并重新运行代码,您将获得结果。 But there is an additional change that you should do, as knn works better with arrays, you should change the dataframes (series) to values with the correct dimension and then operate with them.
但是,您还应该进行其他更改,因为knn与数组更好地配合使用,您应该将数据框(系列)更改为具有正确维度的值,然后对其进行操作。 In your particular case, all of them are series, you can change them by:
在您的特定情况下,它们都是系列,您可以通过以下方式更改它们:
series.values.reshape(-1, 1)
Here is the knn function with all the changes: def knn_train_test(df, train_columns, predict_feature, k_value): 这是具有所有更改的knn函数:def knn_train_test(df,train_columns,预报功能,k_value):
#print(train_columns, k_value)
# Randomly resorts the DataFrame to mitigate sampling bias
#np.random.seed(1)
#df = df.loc[np.random.permutation(len(df))]
# Split the DataFrame into ~75% train / 25% test data sets
split_integer = round(len(df) * 0.75)
train_df = df.iloc[0:split_integer]
test_df = df.iloc[split_integer:]
train_features = train_df[train_columns].values.reshape(-1, 1)
train_target = train_df[predict_feature].values.reshape(-1, 1)
# Trains the model
knn = KNeighborsRegressor(n_neighbors=k_value)
knn.fit(train_features, train_target)
# Test the model & return calculate mean square error
predictions = knn.predict(test_df[train_columns].values.reshape(-1, 1))
print("predictions")
mse = mean_squared_error(y_true=test_df[predict_feature], y_pred=predictions)
return mse
With that, and if I get the correct input file, this is what I got: 这样,如果我得到正确的输入文件,这就是我得到的:
predictions
{'normalized_losses': [100210405.34, 116919980.22444445, 88928383.280000001, 62378305.931836732, 65695537.133086421], 'wheel_base': [10942945.5, 31106845.595555563, 34758670.590399988, 29302177.901632652, 25464306.165925924], 'length': [71007156.219999999, 37635782.111111119, 33676038.287999995, 29868192.295918364, 22553474.111604933], 'width': [42519394.439999998, 25956086.771111108, 15199079.0744, 10443175.389795918, 8440465.6864197534], 'height': [117942530.56, 62910880.079999998, 41771068.588, 33511475.561224483, 31537852.588641971], 'curb_weight': [14514970.42, 6103365.4644444454, 6223489.0728000011, 7282828.3632653067, 6884187.4446913591], 'bore': [57147986.359999999, 88529631.346666679, 68063251.098399997, 58753168.154285707, 42950965.435555562], 'stroke': [145522819.16, 98024560.913333327, 61229681.429599993, 36452809.841224492, 25989788.846172832], 'compression_ratio': [93309449.939999998, 18108906.400000002, 30175663.952, 44964197.869387761, 39926111.747407407], 'horsepower': [25158775.920000002, 17656603.506666664, 13804482.193600001, 15772395.163265305, 14689078.471851852], 'peak_rpm': [169310760.66, 86360741.248888895, 51905953.367999993, 46999120.435102046, 45218343.222716056], 'city_mpg': [15467849.460000001, 12237327.542222224, 10855581.140000001, 11479257.790612245, 11047557.746419754], 'highway_mpg': [17384289.579999998, 15877936.197777782, 7720502.6856000004, 6315372.4963265313, 7118970.4081481481]}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.