简体   繁体   English

尝试理解 ML 上的示例脚本

[英]Trying to understand an example script on ML

I'm trying to work through an example script on machine learning: Common pitfalls in interpretation of coefficients of linear models but I'm having trouble understanding some of the steps.我正在尝试通过一个关于机器学习的示例脚本: 线性模型系数解释中的常见陷阱,但我无法理解某些步骤。 The beginning of the script looks like this:脚本的开头是这样的:

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import fetch_openml

survey = fetch_openml(data_id=534, as_frame=True)

# We identify features `X` and targets `y`: the column WAGE is our
# target variable (i.e., the variable which we want to predict).
X = survey.data[survey.feature_names]
X.describe(include="all")

X.head()

# Our target for prediction is the wage.
y = survey.target.values.ravel()
survey.target.head()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
_ = sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

My problem is in the lines我的问题出在线路上

y = survey.target.values.ravel()
survey.target.head()

If we examine survey.target.head() immediately after these lines, the output is如果我们在这些行之后立即检查survey.target.head() ,输出是

Out[36]: 
0    5.10
1    4.95
2    6.67
3    4.00
4    7.50
Name: WAGE, dtype: float64

How does the model know that WAGE is the target variable?模型如何知道WAGE是目标变量? Does is not have to be explicitly declared?是不是必须显式声明?

The line survey.target.values.ravel() is meant to flatten the array, but in this example it is not necessary.survey.target.values.ravel()旨在展平数组,但在本例中它不是必需的。 survey.target is a pd Series (ie 1 column data frame) and survey.target.values is a numpy array. survey.target 是一个 pd 系列(即 1 列数据框),survey.target.values 是一个 numpy 数组。 You can use both for train/test split since there is only 1 column in survey.target .您可以将两者都用于训练/测试拆分,因为survey.target只有 1 列。

type(survey.target)
pandas.core.series.Series

type(survey.target.values)
numpy.ndarray

If we use just survey.target, you can see that the regression will work:如果我们只使用survey.target,您可以看到回归将起作用:

y = survey.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

train_dataset = X_train.copy()
train_dataset.insert(0, "WAGE", y_train)
sns.pairplot(train_dataset, kind='reg', diag_kind='kde')

在此处输入图片说明

If you have another dataset, for example iris, I want to regress petal width against the rest.如果您有另一个数据集,例如 iris,我想将花瓣宽度与其余数据集进行回归。 You would call the column of the data.frame using the square brackets [] :您将使用方括号[]调用 data.frame 的列:

from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression

dat = load_iris(as_frame=True).frame

X = dat[['sepal length (cm)','sepal width (cm)','petal length (cm)']]
y = dat[['petal width (cm)']]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

LR = LinearRegression()
LR.fit(X_train,y_train)
plt.scatter(x=y_test,y=LR.predict(X_test))

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM