[英]Pulling example data point from linear regression model
我最近在 R-Studio 中創建了一個線性回歸模型,如下所示:
> model1 = lm(price~sqft_living,train)
> pred_train = predict(model1)
> rmse_train = sqrt(mean((pred_train - train$price)^2))
> rmse_train
[1] 261068.9
> pred_test = predict(model1,newdata=test)
> rmse_test = sqrt(mean((pred_test - test$price)^2))
> rmse_test
[1] 262334.4
> sse = sum((pred_train - train$price)^2)
> sst = sum((mean(train$price)-train$price)^2)
> r2 = 1 - sse/sst
> r2
[1] 0.4967993
> summary(model1)
Call:
lm(formula = price ~ sqft_living, data = train)
Residuals:
Min 1Q Median 3Q Max
-1491759 -146386 -24131 106578 4348558
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -47764.278 5250.938 -9.096 <0.0000000000000002 ***
sqft_living 282.092 2.305 122.381 <0.0000000000000002 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 261100 on 15170 degrees of freedom
Multiple R-squared: 0.4968, Adjusted R-squared: 0.4968
F-statistic: 1.498e+04 on 1 and 15170 DF, p-value: < 0.00000000000000022
我的問題是我需要看到“基於model1
,平均而言,1400 平方英尺的房子要花多少錢?”
雖然這聽起來有點傻,但我不知道如何在我的模型中找到它,而且我也沒有在網上搜索它。 任何幫助將不勝感激。
下面是一些顯示數據集外觀的代碼:
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
> glimpse(houses)
Rows: 21,613
Columns: 16
$ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 195440051…
$ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, …
$ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, …
$ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.5…
$ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890…
$ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, …
$ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.…
$ waterfront <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ view <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, …
$ condition <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4, …
$ grade <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7,…
$ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 1890…
$ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, 0…
$ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 200…
$ yr_renovated <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ age <dbl> 59, 63, 82, 49, 28, 13, 19, 52, 55, 12, 50, 72, 87, 37, 1…
要預測給定回歸量新值的響應值,只需創建一個新數據集並在predict
使用它,R 的建模函數輸出的對象是 S3 類對象,因此很可能存在方法,在這種情況下是predict
,為他們寫的。
model <- lm(price ~ sqft_living, houses)
new <- data.frame(sqft_living = 1400)
predict(model, newdata = new)
# 1
#357469.5
至於問題中的RMSE,下面更簡單一些。
rmse <- function(object){
e <- resid(object)
sqrt(mean(e^2, na.rm = TRUE))
}
rmse(model)
#[1] 80374.95
至於評論中的后續問題,
根據模型 1,如果房主要在房子上增加 200 平方英尺,預計價格會上漲多少?
答案很簡單,模型的sqft_living
項系數是平均增加 1 個單位回歸量會導致的預期price
變化。
coef(model)
#(Intercept) sqft_living
# 50960.6653 218.9349
coef(model)[2] * 200
#sqft_living
# 43786.98
如果計算相距 200 個單位的sqft_living
2 個值的價格,也可以獲得此結果。
new2 <- data.frame(sqft_living = c(1400, 1400 + 200))
ypred <- predict(model, newdata = new2)
diff(ypred)
# 2
#43786.98
與上面相同的值。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.