[英]Pandas + sklearn Linear regression fails
我正在嘗試在Python中實現一些線性回歸模型。 請參閱下面的代碼,我使用它進行了線性回歸。
import pandas
salesPandas = pandas.DataFrame.from_csv('home_data.csv')
# check the shape of the DataFrame (rows, columns)
salesPandas.shape
(21613, 20)
from sklearn.cross_validation import train_test_split
train_dataPandas, test_dataPandas = train_test_split(salesPandas, train_size=0.8, random_state=1)
from sklearn.linear_model import LinearRegression
reg_model_Pandas = LinearRegression()
print type(train_dataPandas)
print train_dataPandas.shape
<class 'pandas.core.frame.DataFrame'>
(17290, 20)
print type(train_dataPandas['price'])
print train_dataPandas['price'].shape
<class 'pandas.core.series.Series'>
(17290L,)
X = train_dataPandas
y = train_dataPandas['price']
reg_model_Pandas.fit(X, y)
執行完上面的python代碼后,出現以下錯誤:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-dc363e199032> in <module>()
3 X = train_dataPandas
4 y = train_dataPandas['price']
----> 5 reg_model_Pandas.fit(X, y)
C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\linear_model\base.py in fit(self, X, y, n_jobs)
374 n_jobs_ = self.n_jobs
375 X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 376 y_numeric=True, multi_output=True)
377
378 X, y, X_mean, y_mean, X_std = self._center_data(
C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric)
442 X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
443 ensure_2d, allow_nd, ensure_min_samples,
--> 444 ensure_min_features)
445 if multi_output:
446 y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,
C:\Users\...\AppData\Local\Continuum\Anaconda2\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features)
342 else:
343 dtype = None
--> 344 array = np.array(array, dtype=dtype, order=order, copy=copy)
345 # make sure we actually converted to numeric:
346 if dtype_numeric and array.dtype.kind == "O":
ValueError: invalid literal for float(): 20140610T000000
來自train_dataPandas.info()的輸出
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17290 entries, 4058200630 to 1762600320
Data columns (total 20 columns):
date 17290 non-null object
price 17290 non-null int64
bedrooms 17290 non-null int64
bathrooms 17290 non-null float64
sqft_living 17290 non-null int64
sqft_lot 17290 non-null int64
floors 17290 non-null float64
waterfront 17290 non-null int64
view 17290 non-null int64
condition 17290 non-null int64
grade 17290 non-null int64
sqft_above 17290 non-null int64
sqft_basement 17290 non-null int64
yr_built 17290 non-null int64
yr_renovated 17290 non-null int64
zipcode 17290 non-null int64
lat 17290 non-null float64
long 17290 non-null float64
sqft_living15 17290 non-null int64
sqft_lot15 17290 non-null int64
dtypes: float64(4), int64(15), object(1)
memory usage: 2.8+ MB
因此,感謝EdChum,到目前為止的解決方案是:
Int64Index: 21613 entries, 7129300520 to 1523300157 Data columns (total 20 columns): date 21613 non-null object
這不是很好,因為sklearn無法將日期用作對象
20141013T000000
你看到T了嗎? ...壞
sklearn.linear_model.LinearRegression()。fit()要具有npy數組(Pandas建立在numpy上,因此DataFrame也是numpy數組)
因此,首先將對象轉換為日期時間,然后將其轉換為數字
salesPandas ['date'] = pandas.to_datetime(salesPandas ['date'],format ='%Y%m%dT%H%M%S')
salesPandas ['date'] = pandas.to_numeric(salesPandas ['date'])
如果你那么
reg_model_Pandas.fit(X,y)
有用
根據您的數據的另一種可能的解決方案可能是從文件中讀取日期時指定parse_dates
,例如:
import pandas
salesPandas = pandas.read_csv('home_data.csv', parse_dates=['date'])
之所以有用,是因為當您傳遞要擬合的數據時,可以將其分解為月,小時,天。 這是假設您的大多數數據都集中在前面提到的數據上,而不是年份(即,您的唯一時間總數約為3-4)
在這里,您可以使用Datetimelike屬性,並通過執行salesPandas['date'].dt.month
調用月份,然后按日和小時將其替換。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.