简体   繁体   English

Pandas + sklearn线性回归失败

[英]Pandas + sklearn Linear regression fails

I am trying to implement some linear regression model in Python. 我正在尝试在Python中实现一些线性回归模型。 See the code below, which I've used to make a linear regression. 请参阅下面的代码,我使用它进行了线性回归。

import pandas
salesPandas = pandas.DataFrame.from_csv('home_data.csv')

# check the shape of the DataFrame (rows, columns)
salesPandas.shape
(21613, 20)

from sklearn.cross_validation import train_test_split

train_dataPandas, test_dataPandas = train_test_split(salesPandas, train_size=0.8, random_state=1)

from sklearn.linear_model import LinearRegression

reg_model_Pandas = LinearRegression()

print type(train_dataPandas)
print train_dataPandas.shape
<class 'pandas.core.frame.DataFrame'>
(17290, 20)

print type(train_dataPandas['price'])
print train_dataPandas['price'].shape
<class 'pandas.core.series.Series'>
(17290L,)

X = train_dataPandas
y = train_dataPandas['price']
reg_model_Pandas.fit(X, y)

After I've executed the python code above, the following error appears: 执行完上面的python代码后,出现以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-dc363e199032> in <module>()
      3 X = train_dataPandas
      4 y = train_dataPandas['price']
----> 5 reg_model_Pandas.fit(X, y)

C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, n_jobs)
    374             n_jobs_ = self.n_jobs
    375         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 376                          y_numeric=True, multi_output=True)
    377 
    378         X, y, X_mean, y_mean, X_std = self._center_data(

C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric)
    442     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    443                     ensure_2d, allow_nd, ensure_min_samples,
--> 444                     ensure_min_features)
    445     if multi_output:
    446         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)
    342             else:
    343                 dtype = None
--> 344         array = np.array(array, dtype=dtype, order=order, copy=copy)
    345         # make sure we actually converted to numeric:
    346         if dtype_numeric and array.dtype.kind == "O":

ValueError: invalid literal for float(): 20140610T000000

Output from train_dataPandas.info() 来自train_dataPandas.info()的输出

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17290 entries, 4058200630 to 1762600320
Data columns (total 20 columns):
date             17290 non-null object
price            17290 non-null int64
bedrooms         17290 non-null int64
bathrooms        17290 non-null float64
sqft_living      17290 non-null int64
sqft_lot         17290 non-null int64
floors           17290 non-null float64
waterfront       17290 non-null int64
view             17290 non-null int64
condition        17290 non-null int64
grade            17290 non-null int64
sqft_above       17290 non-null int64
sqft_basement    17290 non-null int64
yr_built         17290 non-null int64
yr_renovated     17290 non-null int64
zipcode          17290 non-null int64
lat              17290 non-null float64
long             17290 non-null float64
sqft_living15    17290 non-null int64
sqft_lot15       17290 non-null int64
dtypes: float64(4), int64(15), object(1)
memory usage: 2.8+ MB

So thanks to EdChum, the solution till now is the following: 因此,感谢EdChum,到目前为止的解决方案是:

  1. First I've uploaded the data 首先,我上传了数据
  2. salesPandas.info() is showing me, that salesPandas.info()向我显示,
 Int64Index: 21613 entries, 7129300520 to 1523300157 Data columns (total 20 columns): date 21613 non-null object 

this isnt good because sklearn, cannot use the date as object 这不是很好,因为sklearn无法将日期用作对象

  1. If I do salesPandas.head() the date for the first tupel is 如果我执行salesPandas.head(),则第一个Tupel的日期为

20141013T000000

you see the T? 你看到T了吗? ...bad ...坏

  1. sklearn.linear_model.LinearRegression().fit() wants to have npy arrays (Pandas is build on numpy so a DataFrame is also a numpy array) sklearn.linear_model.LinearRegression()。fit()要具有npy数组(Pandas建立在numpy上,因此DataFrame也是numpy数组)

  2. So first convert the object to datetime, and then convert it to numeric 因此,首先将对象转换为日期时间,然后将其转换为数字

salesPandas['date'] = pandas.to_datetime(salesPandas['date'], format='%Y%m%dT%H%M%S') salesPandas ['date'] = pandas.to_datetime(salesPandas ['date'],format ='%Y%m%dT%H%M%S')

salesPandas['date'] = pandas.to_numeric(salesPandas['date']) salesPandas ['date'] = pandas.to_numeric(salesPandas ['date'])

  1. If you then 如果你那么

    reg_model_Pandas.fit(X, y) reg_model_Pandas.fit(X,y)

it works 有用

Another possible solution based on your data could be to specify parse_dates when reading the date from file as such: 根据您的数据的另一种可能的解决方案可能是从文件中读取日期时指定parse_dates ,例如:

import pandas
salesPandas = pandas.read_csv('home_data.csv', parse_dates=['date'])

The reason why this would be helpful is when you pass your data to be fitted you can break it up into month, hour, day. 之所以有用,是因为当您传递要拟合的数据时,可以将其分解为月,小时,天。 This is assuming most of your data is concentrated on those previously mentioned and not on years (ie your total unique years is about 3-4) 这是假设您的大多数数据都集中在前面提到的数据上,而不是年份(即,您的唯一时间总数约为3-4)

From here you can use Datetimelike Properties and call the month by doing salesPandas['date'].dt.month , then for day and hour just replace it accordingly. 在这里,您可以使用Datetimelike属性,并通过执行salesPandas['date'].dt.month调用月份,然后按日和小时将其替换。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM