尝试理解 ML 上的示例脚本

Question

I'm trying to work through an example script on machine learning: Common pitfalls in interpretation of coefficients of linear models but I'm having trouble understanding some of the steps.我正在尝试通过一个关于机器学习的示例脚本：线性模型系数解释中的常见陷阱，但我无法理解某些步骤。 The beginning of the script looks like this:脚本的开头是这样的：

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_openml

survey = fetch_openml(data_id=534, as_frame=True)

# We identify features `X` and targets `y`: the column WAGE is our
# target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")

X.head()

# Our target for prediction is the wage.
y = survey.target.values.ravel()
survey.target.head()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

My problem is in the lines我的问题出在线路上

y = survey.target.values.ravel()
survey.target.head()

If we examine survey.target.head() immediately after these lines, the output is如果我们在这些行之后立即检查survey.target.head() ，输出是

Out[36]: 
0    5.10
1    4.95
2    6.67
3    4.00
4    7.50
Name: WAGE, dtype: float64

How does the model know that WAGE is the target variable?模型如何知道WAGE是目标变量？ Does is not have to be explicitly declared?是不是必须显式声明？

Answer 1

The line survey.target.values.ravel() is meant to flatten the array, but in this example it is not necessary.行survey.target.values.ravel()旨在展平数组，但在本例中它不是必需的。 survey.target is a pd Series (ie 1 column data frame) and survey.target.values is a numpy array. survey.target 是一个 pd 系列（即 1 列数据框），survey.target.values 是一个 numpy 数组。 You can use both for train/test split since there is only 1 column in survey.target .您可以将两者都用于训练/测试拆分，因为survey.target只有 1 列。

type(survey.target)
pandas.core.series.Series

type(survey.target.values)
numpy.ndarray

If we use just survey.target, you can see that the regression will work:如果我们只使用survey.target，您可以看到回归将起作用：

y = survey.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

If you have another dataset, for example iris, I want to regress petal width against the rest.如果您有另一个数据集，例如 iris，我想将花瓣宽度与其余数据集进行回归。 You would call the column of the data.frame using the square brackets [] :您将使用方括号[]调用 data.frame 的列：

from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression

dat = load_iris(as_frame=True).frame

X = dat[['sepal length (cm)','sepal width (cm)','petal length (cm)']]
y = dat[['petal width (cm)']]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

LR = LinearRegression()
LR.fit(X_train,y_train)
plt.scatter(x=y_test,y=LR.predict(X_test))

尝试理解 ML 上的示例脚本

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-10-05 16:13:05

尝试理解 ML 上的示例脚本

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-10-05 16:13:05

解决方案1
1 已采纳 2020-10-05 16:13:05