[英]Linear regression with python use sklearn
i try to do Linear regression with python 我尝试使用python进行线性回归
For example 例如
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
x = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(x, y)
x_test = [[8, 2]]
Now the example_data like: 现在example_data像这样:
inches city Pizza_Price
5 A 10
6 B 12
inches is a clear number but area is not. 英寸是一个清楚的数字,但面积不是。
How can I convert a city to a number for calculation? 如何将城市转换为数字进行计算?
How to classify parameters like city into numbers for calculation? 如何将城市等参数分类为数字进行计算?
Code: 码:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
cities = ['A','B','C','B'] # for example
le.fit(cities)
cities = le.transform(cities)
print(cities)
Output: 输出:
[0, 1, 2, 1]
[0,1,2,1]
You'll want to use pd.get_dummies() to convert each city into a binary value. 您将要使用pd.get_dummies()将每个城市转换为二进制值。 Label encoder will assign an integer value to the variable which will make interpreting the regression formula difficult and possibly skewed.
标签编码器将为变量分配一个整数值,这将使解释回归公式变得困难,甚至可能会产生偏差。 Remember to drop one of the dummy variables to avoid multi-collinearity.
请记住删除一个虚拟变量,以避免多重共线性。
From your data shown in example_data
, it looks like you are working with data in a Pandas DataFrame
. 从
example_data
显示的数据看来,您正在使用Pandas DataFrame
数据。 So, I would suggest another possible approach to answering your question 因此,我建议另一种可能的方法来回答您的问题
Here is some data I generated in the same format as yours but with extra rows 这是我以与您相同的格式生成的一些数据,但带有额外的行
d = [
['inches','city','Pizza_Price'],
[5,'A',10],
[6,'B',12],
[7,'C',15],
[8,'D',11],
[9,'B',12],
[10,'C',17],
[11,'D',16]
]
df = pd.DataFrame(d[1:], columns=d[0])
print(df)
inches city Pizza_Price
0 5 A 10
1 6 B 12
2 7 C 15
3 8 D 11
4 9 B 12
5 10 C 17
6 11 D 16
The conversion of the city
column into integers can be done using LabelEncoder
(as shown in this SO post ), per @Wen-Ben's suggestion 根据@ Wen-Ben的建议 ,可以使用
LabelEncoder
(如本SO post所示)将city
列转换为整数。
df['city'] = pd.DataFrame(columns=['city'],
data=LabelEncoder().fit_transform(
df['city'].values.flatten())
)
print(df)
inches city Pizza_Price
0 5 0 10
1 6 1 12
2 7 2 15
3 8 3 11
4 9 1 12
5 10 2 17
6 11 3 16
Step 1. Perform the train-test split to get the training and testing data X_train
, y_train
, etc. 步骤1.执行训练测试拆分以获取训练和测试数据
X_train
, y_train
等。
features = ['inches', 'city']
target = 'Pizza_Price'
X = df[features]
y = df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33,
random_state=42)
# (OPTIONAL) Check number of rows in X and y of each split
print(len(X_train), len(y_train))
print(len(X_test), len(y_test))
4 4
3 3
Step 2. (Optional) Append a column to your source DataFrame
( example_data
) that shows which rows are used in training and testing 第2步。 (可选)在您的源
DataFrame
( example_data
) DataFrame
添加一列,以显示在培训和测试中使用了哪些行
df['Type'] = 'Test'
df.loc[X_train.index, 'Type'] = 'Train'
Step 3. Instantiate the LinearRegression
model and train the model using the training dataset - see this link from sklearn
docs 步骤3.实例化
LinearRegression
模型并使用训练数据集训练模型-请参阅sklearn
文档中的此链接
model = LinearRegression()
model.fit(X_train, y_train)
Step 4. Now, make out-of-sample predictions on the testing data and (optionally) append the predicted values as a separate column to the example_data
步骤4.现在,对测试数据进行样本外预测,并(可选)将预测值作为单独的列附加到
example_data
NaN
NaN
df['Predicted_Pizza_Price'] = np.nan
df.loc[X_test.index, 'Predicted_Pizza_Price'] = model.predict(X_test)
print(df)
inches city Pizza_Price Type Predicted_Pizza_Price
0 5 0 10 Test 11.0
1 6 1 12 Test 11.8
2 7 2 15 Train NaN
3 8 3 11 Train NaN
4 9 1 12 Train NaN
5 10 2 17 Test 14.0
6 11 3 16 Train NaN
Step 5. Generate model evaluation metrics (see point number 15. from here ) 步骤5.生成模型评估指标(请参阅第15点,从此处开始 )
DataFrame
showing both the (a) model evaluation metrics and (b) model properties - the Linear Regression coefficients and the intercept DataFrame
其中同时显示(a)模型评估指标和(b)模型属性-线性回归系数和截距 DataFrame
DataFrame
Create a blank dictionary to hold the model properties (coefficient, intercept) and evaluation metrics 创建一个空白字典来保存模型属性(系数,截距)和评估指标
dict_summary = {}
Append coefficient and intercept to dictionary 附加系数并拦截到字典
for m,feature in enumerate(features):
dict_summary['Coefficient ({})' .format(feature)] = model.coef_[m]
dict_summary['Intercept'] = model.intercept_
Append evaluation metrics to dictionary 将评估指标附加到字典
y_test = df.loc[X_test.index, 'Pizza_Price'].values
y_pred = df.loc[X_test.index, 'Predicted_Pizza_Price'].values
dict_summary['Mean Absolute Error (MAE)'] = metrics.mean_absolute_error(
y_test, y_pred)
dict_summary['Mean Squared Error (MSE)'] = metrics.mean_squared_error(
y_test, y_pred)
dict_summary['Root Mean Squared Error (RMSE)'] = np.sqrt(
metrics.mean_squared_error(y_test, y_pred)
)
Convert dictionary into summary DataFrame
showing regression model properties and evaluation metrics 将字典转换为摘要
DataFrame
显示回归模型属性和评估指标
df_metrics = pd.DataFrame.from_dict(dict_summary, orient='index', columns=['value'])
df_metrics.index.name = 'metric'
df_metrics.reset_index(drop=False, inplace=True)
Output of model evaluation DataFrame
模型评估
DataFrame
输出
print(df_metrics)
metric value
0 Coefficient (inches) 0.466667
1 Coefficient (city) 0.333333
2 Intercept 8.666667
3 Mean Absolute Error (MAE) 1.400000
4 Mean Squared Error (MSE) 3.346667
5 Root Mean Squared Error (RMSE) 1.829390
Using this approach, since you have results in Pandas 2 DataFrame
s, Pandas tools can be used to visualize the results of the regression analysis. 使用这种方法,由于您在Pandas 2
DataFrame
获得了结果,因此可以使用Pandas工具可视化回归分析的结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.