简体   繁体   English

Scikit-learn:输入包含 NaN、无穷大或对于 dtype ('float64') 来说太大的值

[英]Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

I'm using Python scikit-learn for simple linear regression on data obtained from csv.我正在使用 Python scikit-learn 对从 csv 获得的数据进行简单的线性回归。

reader = pandas.io.parsers.read_csv("data/all-stocks-cleaned.csv")
stock = np.array(reader)

openingPrice = stock[:, 1]
closingPrice = stock[:, 5]

print((np.min(openingPrice)))
print((np.min(closingPrice)))
print((np.max(openingPrice)))
print((np.max(closingPrice)))

peningPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
    train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)


openingPriceTrain = np.reshape(openingPriceTrain,(openingPriceTrain.size,1))

openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)
# openingPriceTrain = np.arange(openingPriceTrain, dtype=np.float64)

closingPriceTrain = np.reshape(closingPriceTrain,(closingPriceTrain.size,1))
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)

openingPriceTest = np.reshape(openingPriceTest,(openingPriceTest.size,1))
closingPriceTest = np.reshape(closingPriceTest,(closingPriceTest.size,1))

regression = linear_model.LinearRegression()

regression.fit(openingPriceTrain, closingPriceTrain)

predicted = regression.predict(openingPriceTest)

The min and max values are showed as 0.0 0.6 41998.0 2593.9最小值和最大值显示为 0.0 0.6 41998.0 2593.9

Yet I'm getting this error ValueError: Input contains NaN, infinity or a value too large for dtype('float64').然而我收到这个错误 ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

How should I remove this error?我应该如何消除这个错误? Because from the above result it is true that it doesn't contain infinites or Nan values.因为从上面的结果来看,它确实不包含无穷大或 Nan 值。

What's the solution for this?对此有什么解决方案?

Edit: all-stocks-cleaned.csv is avaliabale at http://www.sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/all-stocks-cleaned.csv编辑:all-stocks-cleaned.csv 在http://www.sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/all-stocks-cleaned.csv可用

The problem with your regression is that somehow NaN 's have sneaked into your data.您回归的问题在于NaN不知何故潜入了您的数据。 This could be easily checked with the following code snippet:这可以使用以下代码片段轻松检查:

import pandas as pd
import numpy as np
from  sklearn import linear_model
from sklearn.cross_validation import train_test_split

reader = pd.io.parsers.read_csv("./data/all-stocks-cleaned.csv")
stock = np.array(reader)

openingPrice = stock[:, 1]
closingPrice = stock[:, 5]

openingPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
    train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)

openingPriceTrain = openingPriceTrain.reshape(openingPriceTrain.size,1)
openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)

closingPriceTrain = closingPriceTrain.reshape(closingPriceTrain.size,1)
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)

openingPriceTest = openingPriceTest.reshape(openingPriceTest.size,1)
openingPriceTest = openingPriceTest.astype(np.float64, copy=False)

np.isnan(openingPriceTrain).any(), np.isnan(closingPriceTrain).any(), np.isnan(openingPriceTest).any()

(True, True, True)

If you try imputing missing values like below:如果您尝试输入缺失值,如下所示:

openingPriceTrain[np.isnan(openingPriceTrain)] = np.median(openingPriceTrain[~np.isnan(openingPriceTrain)])
closingPriceTrain[np.isnan(closingPriceTrain)] = np.median(closingPriceTrain[~np.isnan(closingPriceTrain)])
openingPriceTest[np.isnan(openingPriceTest)] = np.median(openingPriceTest[~np.isnan(openingPriceTest)])

your regression will run smoothly without a problem:您的回归将顺利运行,没有问题:

regression = linear_model.LinearRegression()

regression.fit(openingPriceTrain, closingPriceTrain)

predicted = regression.predict(openingPriceTest)

predicted[:5]

array([[ 13598.74748173],
       [ 53281.04442146],
       [ 18305.4272186 ],
       [ 50753.50958453],
       [ 14937.65782778]])

In short: you have missing values in your data, as the error message said.简而言之:正如错误消息所说,您的数据中存在缺失值。

EDIT: :编辑::

perhaps an easier and more straightforward approach would be to check if you have any missing data right after you read the data with pandas:也许一种更简单、更直接的方法是在使用 Pandas 读取数据后立即检查是否有任何丢失的数据:

data = pd.read_csv('./data/all-stocks-cleaned.csv')
data.isnull().any()
Date                    False
Open                     True
High                     True
Low                      True
Last                     True
Close                    True
Total Trade Quantity     True
Turnover (Lacs)          True

and then impute the data with any of the two lines below:然后使用以下两行中的任何一行来估算数据:

data = data.fillna(lambda x: x.median())

or或者

data = data.fillna(method='ffill')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 ValueError:输入包含 NaN、无穷大或对于 scikit-learn 的 dtype('float64') 来说太大的值 - ValueError: Input contains NaN, infinity or a value too large for dtype('float64') with scikit-learn Scikit-learn SequentialFeatureSelector 输入包含 NaN、无穷大或对于 dtype('float64') 而言太大的值。 即使有管道 - Scikit-learn SequentialFeatureSelector Input contains NaN, infinity or a value too large for dtype('float64'). even with pipeline scikit-learn:拟合模型错误-输入包含NaN,无穷大或对于float64而言太大的值 - Scikit-learn: error in fitting model - Input contains NaN, infinity or a value too large for float64 Scikit-learn - ValueError:输入包含 NaN、无穷大或对于 dtype('float32') 和随机森林来说太大的值 - Scikit-learn - ValueError: Input contains NaN, infinity or a value too large for dtype('float32') with Random Forest 输入包含对于 dtype“float64”来说太大的无穷大值 - Input contains infinity of value too large for dtype “float64” 当我缩放数据时,输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值 - Input contains NaN, infinity or a value too large for dtype('float64') when I scale my data ValueError:在预处理数据时,输入包含NaN,无穷大或对于dtype('float64')而言太大的值 - ValueError: Input contains NaN, infinity or a value too large for dtype('float64') while preprocessing Data fit_transform中的错误:输入包含NaN,无穷大或对于dtype('float64')而言太大的值 - Error in fit_transform: Input contains NaN, infinity or a value too large for dtype('float64') ValueError: 输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值。 对于我的 knn 模型 - ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). for my knn model StandardScaler -ValueError:输入包含NaN,无穷大或对于dtype('float64')而言太大的值 - StandardScaler -ValueError: Input contains NaN, infinity or a value too large for dtype('float64')
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM