I am currently working on a dataset that looks like this
Streamflow | x1 | x2 | x3 | x4 |
---|---|---|---|---|
145 | 24 | 25 | 43 | 44 |
198 | 30 | 45 | 66 | 54 |
285 | 32 | 39 | 54 | 47 |
... | ... | ... | ... | .. |
all the way down 4408 rows in total. What I want to do is conduct linear regression of streamflow ~ x1 + x2 + x3 + x4
from row 1 to row 20, then from row 2 to row 21, and the from row 3 to row 22, till the end so that I can get a set of coefficient every run. I know that I might need for
to start the loop, but I just cannot figure out how to have it work on every 20th rows. Any suggestion will be really appreciated. Thank you in advance.
for(i in 1:nrow(CFbasin)) {
y <- CFbasin[i:(i+20), 2]
x1 <- CFbasin[i:(i+20), 3]
x2 <- CFbasin[i:(i+20), 4]
mod_coef[i] <- coef(lm(y ~ x1 + x2))
}
So this is what I wrote and it doesn't give me the ideal results
You could do something like this, if I understand correctly what you mean:
mutate(group20rows = as.integer(gl(n(), 20, n())), .before=1) %>%
it is only preparation of fake datagroup_split
, then you have these groups of data each 20 map_dfr
to iterate over each group your regressionglance()
from broom
package to show all in one dataframe in a tidy fromlibrary(tidyverse)
library(broom)
colnames1 <- c("streamflow", "x1", "x2", "x3", "x4")
iris %>%
select(-5) %>%
mutate(Sepal.Length1 = Sepal.Length) %>%
rename_with(~colnames1) %>%
mutate(streamflow=streamflow*100) %>%
mutate(group20rows = as.integer(gl(n(), 20, n())), .before=1) %>%
mutate(group20rows = as_factor(group20rows)) %>%
group_by(group20rows) %>%
group_split() %>%
map_dfr(.f = function(df){
lm(streamflow ~ x1+x2+x3, data = df) %>%
glance() %>%
add_column(group20rows = unique(df$group20rows), .before=1)
})
output:
group20rows r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 0.808 0.772 20.4 22.4 0.00000567 3 -86.5 183. 188. 6662. 16 20
2 2 0.335 0.210 26.2 2.69 0.0815 3 -91.4 193. 198. 10961. 16 20
3 3 0.873 0.850 32.0 36.8 0.000000207 3 -95.5 201. 206. 16420. 16 20
4 4 0.511 0.419 34.4 5.57 0.00819 3 -96.9 204. 209. 18919. 16 20
5 5 0.530 0.442 30.0 6.01 0.00609 3 -94.2 198. 203. 14417. 16 20
6 6 0.877 0.854 28.0 38.0 0.000000165 3 -92.8 196. 200. 12502. 16 20
7 7 0.787 0.747 32.5 19.7 0.0000128 3 -95.8 202. 207. 16895. 16 20
8 8 0.644 0.466 28.0 3.62 0.0845 3 -45.0 99.9 101. 4719. 6 10
Here is a base R approach using the iris
data set also:
data(iris)
str(iris)
# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Since there are 150 rows, we compute the number of rolling groups of 20 and use that to create a matrix with 20 rows and 131 columns listing the row numbers to be used in each regression:
rows <- nrow(iris)
last <- rows + 1 - 20
idx <- sapply(1:last, seq, length.out=20)
str(idx)
# num [1:20, 1:131] 1 2 3 4 5 6 7 8 9 10 ...
So we have 131 columns and each column identifies a group of 20 rows for a regression. Now compute the 131 regressions and save the coefficients:
results <- lapply(1:131, function(x) lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris[idx[, x], ]))
coeffs <- t(sapply(results, coef))
head(coeffs)
# (Intercept) Sepal.Width Petal.Length Petal.Width
# [1,] 0.88165253 1.1027541 0.4335847 -1.3039612
# [2,] 0.64094220 1.1111668 0.6186075 -1.4860753
# [3,] 0.28030724 1.2120241 0.6477881 -1.7022181
# [4,] -0.01943516 1.1879500 0.8971728 -1.6773764
# [5,] 0.46106345 0.9888293 0.9230228 -0.8457783
# [6,] 0.92206667 0.9734378 0.5716684 -0.5058189
Each regression is stored as a list in results
so that the first regression is results[[1]]
.
summary(results[[1]])
#
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width,
# data = iris[idx[, x], ])
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.26396 -0.17137 -0.00562 0.13582 0.36386
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.8817 0.6730 1.310 0.209
# Sepal.Width 1.1028 0.1748 6.309 1.04e-05 ***
# Petal.Length 0.4336 0.3448 1.257 0.227
# Petal.Width -1.3040 0.7924 -1.646 0.119
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.204 on 16 degrees of freedom
# Multiple R-squared: 0.8078, Adjusted R-squared: 0.7717
# F-statistic: 22.41 on 3 and 16 DF, p-value: 5.666e-06
Getting a statistic computed by summary
is slightly more involved:
Rsq <- sapply(results, function(x) summary(x)$adj.r.squared)
# quantile(Rsq)
# 0% 25% 50% 75% 100%
# 0.1635166 0.4471409 0.6298927 0.8417655 0.9278258
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.