ValueError：使用KNeighborsRegressor的拟合，输入包含NaN，无穷大或对于dtype（'float64'）而言太大的值

Question

Prior to attempting the fit I have thoroughly cleaned my data frame and ensured that the entire data frame has no inf or NaN values and is composed of entirely non-null float64 values. 在尝试拟合之前，我已经彻底清理了数据框，并确保整个数据框没有inf或NaN值，并且完全由非null的float64值组成。 However, I still redundantly verified this using np.isinf(), df.isnull().sum() and df.info() methods. 但是，我仍然使用np.isinf（），df.isnull（）。sum（）和df.info（）方法对此进行了冗余验证。 All my research showed that others with the same issue had NaN, inf, or object data type in their data frame. 我的所有研究表明，其他遇到相同问题的人在其数据框中具有NaN，inf或对象数据类型。 This is not so in my case. 就我而言，情况并非如此。 Lastly, I found a vaguely similar case which found a resolution using this code: 最后，我发现了一个类似的案例，该案例使用以下代码找到了解决方法：

df = df.apply(lambda x: pd.to_numeric(x,errors='ignore'))

This did not help in my situation. 这对我的情况没有帮助。 How can I resolve this ValueError exception? 如何解决此ValueError异常？

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np

# Read csv file and assign column names
headers=['symboling','normalized_losses','make','fuel_type','aspiration','num_of_doors',
         'body_style','drive_wheels','engine_location','wheel_base','length','width',
        'height','curb_weight','engine_type','num_of_cylinders','engine_size','fuel_system',
        'bore','stroke','compression_ratio','horsepower','peak_rpm','city_mpg','highway_mpg',
        'price']
cars = pd.read_csv('imports-85.data.txt', names=headers)

# Select only the columns with continuous values from - https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
continuous_values_cols = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 
                          'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
numeric_cars = cars[continuous_values_cols].copy()

# Clean Data Set by Convert missing values (?) with np.NaN then set the type to float
numeric_cars.replace(to_replace='?', value=np.nan, inplace=True)
numeric_cars = numeric_cars.astype('float')

# Because the column we're trying to predict is 'price', any row were price is NaN will be removed."
numeric_cars.dropna(subset=['price'], inplace=True)

# All remaining NaN's will be filled with the mean of its respective column
numeric_cars = numeric_cars.fillna(numeric_cars.mean())

# Create training feature list and k value list
test_features = numeric_cars.columns.tolist()
predictive_feature = 'price'
test_features.remove(predictive_feature)
k_values = [x for x in range(10) if x/2 != round(x/2)]

# Normalize columns
numeric_cars_normalized = numeric_cars[test_features].copy()
numeric_cars_normalized = numeric_cars_normalized/ numeric_cars.max()
numeric_cars_normalized[predictive_feature] = numeric_cars[predictive_feature].copy()


def knn_train_test(df, train_columns, predict_feature, k_value):

    # Randomly resorts the DataFrame to mitigate sampling bias
    np.random.seed(1)
    df = df.loc[np.random.permutation(len(df))]

    # Split the DataFrame into ~75% train / 25% test data sets
    split_integer = round(len(df) * 0.75)
    train_df = df.iloc[0:split_integer]
    test_df = df.iloc[split_integer:]

    train_features = train_df[train_columns]
    train_target = train_df[predict_feature]

    # Trains the model
    knn = KNeighborsRegressor(n_neighbors=k_value)
    knn.fit(train_features, train_target)

    # Test the model & return calculate mean square error
    predictions = knn.predict(test_df[train_columns])
    print("predictions")
    mse = mean_squared_error(y_true=test_df[predict_feature], y_pred=predictions)
    return mse


# instantiate mse dict
mse_dict = {}

# test each feature and do so with a range of k values
# in an effot to determine the optimal training feature and k value
for feature in test_features:

    mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
    mse_dict[feature] = mse

print(mse_dict)

Here's the full error trace back: 这是完整的错误回溯：

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py:395: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and will raise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)
Traceback (most recent call last):
  File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 76, in <module>
    mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
  File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 76, in <listcomp>
    mse = [knn_train_test(numeric_cars_normalized,feature, predictive_feature, k) for k in k_values]
  File "C:\DATAQUEST\06_MachineLearning\01_ML_Fundamentals\06_GuidedProject_PredictingCarPrices\PredictingCarPrices.py", line 60, in knn_train_test
    knn.fit(train_features, train_target)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\neighbors\base.py", line 741, in fit
    X, y = check_X_y(X, y, "csr", multi_output=True)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 521, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 407, in check_array
    _assert_all_finite(array)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 58, in _assert_all_finite
    " or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Here's the code and output I used to verify that there are no NaN or inf values in my DataFrame: 这是我用来验证DataFrame中没有NaN或inf值的代码和输出：

# Verify data for NaN and inf
print(len(numeric_cars_normalized))
# 201

print(numeric_cars_normalized.info())
# <class 'pandas.core.frame.DataFrame'>
# Int64Index: 201 entries, 0 to 204
# Data columns (total 14 columns):
# bore                 201 non-null float64
# city_mpg             201 non-null float64
# compression_ratio    201 non-null float64
# curb_weight          201 non-null float64
# height               201 non-null float64
# highway_mpg          201 non-null float64
# horsepower           201 non-null float64
# length               201 non-null float64
# normalized_losses    201 non-null float64
# peak_rpm             201 non-null float64
# price                201 non-null float64
# stroke               201 non-null float64
# wheel_base           201 non-null float64
# width                201 non-null float64
# dtypes: float64(14)
# memory usage: 23.6 KB
# None

print(numeric_cars_normalized.isnull().sum())
# bore                 0
# city_mpg             0
# compression_ratio    0
# curb_weight          0
# height               0
# highway_mpg          0
# horsepower           0
# length               0
# normalized_losses    0
# peak_rpm             0
# price                0
# stroke               0
# wheel_base           0
# width                0
# dtype: int64

# The loop below, essentially does the same as the above
# verification, but using different methods
# the purpose is to prove there's no nan or inf in my data set
index = []
NaN_counter = []
inf_counter = []
for col in numeric_cars_normalized.columns:
    index.append(col)
    # inf counter
    col_isinf = np.isinf(numeric_cars_normalized[col])
    if col_isinf.value_counts().index[0] == False:
        inf_counter.append(col_isinf.value_counts()[0])

    # nan counter    
    col_isnan = np.isnan(numeric_cars_normalized[col])
    if col_isnan.value_counts().index[0] == False:
        NaN_counter.append(col_isnan.value_counts()[0])

data_check = {'NOT_NaN_count': NaN_counter, 'NOT_inf_count': inf_counter}
data_verification = pd.DataFrame(data=data_check, index=index)
print(data_verification)

#                    NOT_NaN_count  NOT_inf_count
# bore                         201            201
# city_mpg                     201            201
# compression_ratio            201            201
# curb_weight                  201            201
# height                       201            201
# highway_mpg                  201            201
# horsepower                   201            201
# length                       201            201
# normalized_losses            201            201
# peak_rpm                     201            201
# price                        201            201
# stroke                       201            201
# wheel_base                   201            201
# width                        201            201

I may have found the problem, but still not sure how to fix it. 我可能已经找到了问题，但仍然不确定如何解决。

# Here's a another methodology for extra redudnant data checking
index = []
NaN_counter = []
inf_counter = []

for col in numeric_cars_normalized.columns:
    index.append(col)
    inf_counter.append(np.any(np.isfinite(numeric_cars_normalized[col])))
    NaN_counter.append(np.any(np.isnan(numeric_cars_normalized[col])))

data_check = {'Any_NaN': NaN_counter, 'Any_inf': inf_counter}
data_verification = pd.DataFrame(data=data_check, index=index)
print(data_verification)

                   Any_NaN  Any_inf
# bore                 False     True
# city_mpg             False     True
# compression_ratio    False     True
# curb_weight          False     True
# height               False     True
# highway_mpg          False     True
# horsepower           False     True
# length               False     True
# normalized_losses    False     True
# peak_rpm             False     True
# price                False     True
# stroke               False     True
# wheel_base           False     True
# width                False     True

So clearly I have inf in my DataSet, but I'm not sure why or how to fix it. 很明显，我在DataSet中有inf，但是我不确定为什么或如何修复它。

Answer 1

The problem that seems you are having comes from the permutation that you are doing, by commenting these two lines: 通过注释以下两行，您似乎遇到的问题来自您所做的排列：

# np.random.seed(1)
# df = df.loc[np.random.permutation(len(df))]

This is because when you clean your data, you end up with only 201 rows from 204 of them. 这是因为当您清理数据时，最终只能得到204行中的201行。 By debugging the dataframe that you provide to the knn function, you can find that indeed, three of the rows are now 'nan' for all columns once the numeric_cars_normalized have been permuted. 通过调试提供给knn函数的数据帧，您可以发现，确实，一旦对numeric_cars_normalized进行了置换，所有列的三行现在都是“ nan”。

and rerunning the code, you will obtain results. 并重新运行代码，您将获得结果。 But there is an additional change that you should do, as knn works better with arrays, you should change the dataframes (series) to values with the correct dimension and then operate with them. 但是，您还应该进行其他更改，因为knn与数组更好地配合使用，您应该将数据框（系列）更改为具有正确维度的值，然后对其进行操作。 In your particular case, all of them are series, you can change them by: 在您的特定情况下，它们都是系列，您可以通过以下方式更改它们：

series.values.reshape(-1, 1)

Here is the knn function with all the changes: def knn_train_test(df, train_columns, predict_feature, k_value): 这是具有所有更改的knn函数：def knn_train_test（df，train_columns，预报功能，k_value）：

    #print(train_columns, k_value)
    # Randomly resorts the DataFrame to mitigate sampling bias
    #np.random.seed(1)
    #df = df.loc[np.random.permutation(len(df))]

    # Split the DataFrame into ~75% train / 25% test data sets
    split_integer = round(len(df) * 0.75)
    train_df = df.iloc[0:split_integer]
    test_df = df.iloc[split_integer:]

    train_features = train_df[train_columns].values.reshape(-1, 1)
    train_target = train_df[predict_feature].values.reshape(-1, 1)

    # Trains the model
    knn = KNeighborsRegressor(n_neighbors=k_value)
    knn.fit(train_features, train_target)

    # Test the model & return calculate mean square error
    predictions = knn.predict(test_df[train_columns].values.reshape(-1,   1))
    print("predictions")
    mse = mean_squared_error(y_true=test_df[predict_feature], y_pred=predictions)
    return mse

With that, and if I get the correct input file, this is what I got: 这样，如果我得到正确的输入文件，这就是我得到的：

predictions
{'normalized_losses': [100210405.34, 116919980.22444445, 88928383.280000001, 62378305.931836732, 65695537.133086421], 'wheel_base': [10942945.5, 31106845.595555563, 34758670.590399988, 29302177.901632652, 25464306.165925924], 'length': [71007156.219999999, 37635782.111111119, 33676038.287999995, 29868192.295918364, 22553474.111604933], 'width': [42519394.439999998, 25956086.771111108, 15199079.0744, 10443175.389795918, 8440465.6864197534], 'height': [117942530.56, 62910880.079999998, 41771068.588, 33511475.561224483, 31537852.588641971], 'curb_weight': [14514970.42, 6103365.4644444454, 6223489.0728000011, 7282828.3632653067, 6884187.4446913591], 'bore': [57147986.359999999, 88529631.346666679, 68063251.098399997, 58753168.154285707, 42950965.435555562], 'stroke': [145522819.16, 98024560.913333327, 61229681.429599993, 36452809.841224492, 25989788.846172832], 'compression_ratio': [93309449.939999998, 18108906.400000002, 30175663.952, 44964197.869387761, 39926111.747407407], 'horsepower': [25158775.920000002, 17656603.506666664, 13804482.193600001, 15772395.163265305, 14689078.471851852], 'peak_rpm': [169310760.66, 86360741.248888895, 51905953.367999993, 46999120.435102046, 45218343.222716056], 'city_mpg': [15467849.460000001, 12237327.542222224, 10855581.140000001, 11479257.790612245, 11047557.746419754], 'highway_mpg': [17384289.579999998, 15877936.197777782, 7720502.6856000004, 6315372.4963265313, 7118970.4081481481]}

ValueError：使用KNeighborsRegressor的拟合，输入包含NaN，无穷大或对于dtype（'float64'）而言太大的值

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-03-01 15:32:56

ValueError：使用KNeighborsRegressor的拟合，输入包含NaN，无穷大或对于dtype（&#39;float64&#39;）而言太大的值

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-03-01 15:32:56

ValueError：使用KNeighborsRegressor的拟合，输入包含NaN，无穷大或对于dtype（'float64'）而言太大的值

解决方案1
2 已采纳 2018-03-01 15:32:56