简体   繁体   English

Sklearn 多个训练集

[英]Sklearn multiple training sets

I'm meddling with sklearn and diabetes dataset in order to create linear regression.我正在干预 sklearn 和糖尿病数据集以创建线性回归。 So far I've done:到目前为止我已经完成了:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split


diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

Then I have chosen 3 columns - indexes 0,2 and 3 - age, bmi and bp.然后我选择了 3 列 - 索引 0,2 和 3 - 年龄、bmi 和 bp。

diabetes_Xage = diabetes_X[:, np.newaxis, 0] #age
diabetes_Xbmi = diabetes_X[:, np.newaxis, 2] #bmi
diabetes_Xbp = diabetes_X[:, np.newaxis, 3] #bp

Then I split data 80/20 but i want to combine 4 data sets.然后我将数据拆分为 80/20,但我想合并 4 个数据集。 I've done it like this:我是这样做的:

diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
    diabetes_Xage, diabetes_y, test_size=0.8, random_state=0)

diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
    diabetes_Xbmi, diabetes_y, test_size=0.8, random_state=0)

diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
    diabetes_Xbp, diabetes_y, test_size=0.8, random_state=0)

Now I'm trying to make linear regression and coefficients现在我正在尝试进行线性回归和系数

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

#coefficients
print("Coefficients: \n", regr.coef_)
#mean squared error
print("Mean squared error: %.2f" % mean_squared_error(diabetes_y_test, diabetes_y_pred))
#coefficient of determination
print("Coefficient of determination: %.2f" % r2_score(diabetes_y_test, diabetes_y_pred))

And the outcome is:结果是:

Coefficients: 
 [815.11490401]
Mean squared error: 4695.76
Coefficient of determination: 0.18

My problem is that I have 3 datasets and the code I currently prepared takes into account only the last entered dataset (diabetes_Xbp).我的问题是我有 3 个数据集,我目前准备的代码只考虑了最后输入的数据集 (diabetes_Xbp)。 How should I correct the code so that the result shows the outcome of all 4 data sets combined?我应该如何更正代码,以便结果显示所有 4 个数据集的组合结果?

Everytime you call train_test_split() you are overwriting the previous variable assignments to diabetes_X_train, diabetes_X_test .每次调用train_test_split()时,都会覆盖之前对diabetes_X_train, diabetes_X_test的变量赋值。

I would first store the 3 diabetes variables in a single np array: diabetes = diabetes_X[:,[0,2,3]]我首先将 3 个糖尿病变量存储在一个 np 数组中: diabetes = diabetes_X[:,[0,2,3]]

Then you can make a single call to the data splitter然后你可以对数据拆分器进行一次调用

diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(
    diabetes, diabetes_y, test_size=0.8, random_state=0)

Additionally, setting test_size=0.8 means you are training on 20% of data and evaluating on 80%.此外,设置test_size=0.8意味着您正在训练 20% 的数据并评估 80%。 I think you want that the other way around.我认为你想要相反的方式。

Regarding your final question, whether performance will go up with additional data, is hard to say.关于您的最后一个问题,很难说 go 的性能是否会随着其他数据的增加而提高。 Mostly likely some additional features will improve performance, but can also lead to overfitting.很可能一些附加功能会提高性能,但也可能导致过度拟合。 Try taking a look at sklearn's feature selection methods .尝试查看 sklearn 的 特征选择方法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM