How to Loop linear regression repeatedly for n rows in R?

Question

I am currently working on a dataset that looks like this

Streamflow	x1	x2	x3	x4
145	24	25	43	44
198	30	45	66	54
285	32	39	54	47
...	...	...	...	..

all the way down 4408 rows in total. What I want to do is conduct linear regression of streamflow ~ x1 + x2 + x3 + x4 from row 1 to row 20, then from row 2 to row 21, and the from row 3 to row 22, till the end so that I can get a set of coefficient every run. I know that I might need for to start the loop, but I just cannot figure out how to have it work on every 20th rows. Any suggestion will be really appreciated. Thank you in advance.

for(i in 1:nrow(CFbasin)) {
  y <- CFbasin[i:(i+20), 2]
  x1 <- CFbasin[i:(i+20), 3]
  x2 <- CFbasin[i:(i+20), 4]
  mod_coef[i] <- coef(lm(y ~ x1 + x2))
}

So this is what I wrote and it doesn't give me the ideal results

Answer 1

You could do something like this, if I understand correctly what you mean:

Until this line mutate(group20rows = as.integer(gl(n(), 20, n())), .before=1) %>% it is only preparation of fake data
Create a sequence of 20 with your desired column.
group and apply group_split , then you have these groups of data each 20
apply map_dfr to iterate over each group your regression
use glance() from broom package to show all in one dataframe in a tidy from

library(tidyverse)
library(broom)

colnames1 <- c("streamflow", "x1", "x2", "x3", "x4")

iris %>% 
  select(-5) %>% 
  mutate(Sepal.Length1 = Sepal.Length) %>% 
  rename_with(~colnames1) %>% 
  mutate(streamflow=streamflow*100) %>% 
  mutate(group20rows = as.integer(gl(n(), 20, n())), .before=1) %>% 
  mutate(group20rows = as_factor(group20rows)) %>% 
  group_by(group20rows) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(streamflow ~ x1+x2+x3, data = df) %>% 
      glance() %>% 
      add_column(group20rows = unique(df$group20rows), .before=1)
  })

output:

  group20rows r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <fct>           <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 1               0.808         0.772  20.4     22.4  0.00000567      3  -86.5 183.   188.    6662.          16    20
2 2               0.335         0.210  26.2      2.69 0.0815          3  -91.4 193.   198.   10961.          16    20
3 3               0.873         0.850  32.0     36.8  0.000000207     3  -95.5 201.   206.   16420.          16    20
4 4               0.511         0.419  34.4      5.57 0.00819         3  -96.9 204.   209.   18919.          16    20
5 5               0.530         0.442  30.0      6.01 0.00609         3  -94.2 198.   203.   14417.          16    20
6 6               0.877         0.854  28.0     38.0  0.000000165     3  -92.8 196.   200.   12502.          16    20
7 7               0.787         0.747  32.5     19.7  0.0000128       3  -95.8 202.   207.   16895.          16    20
8 8               0.644         0.466  28.0      3.62 0.0845          3  -45.0  99.9  101.    4719.           6    10

Answer 2

Here is a base R approach using the iris data set also:

data(iris)
str(iris)
# 'data.frame': 150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Since there are 150 rows, we compute the number of rolling groups of 20 and use that to create a matrix with 20 rows and 131 columns listing the row numbers to be used in each regression:

rows <- nrow(iris)
last <- rows + 1 - 20
idx <- sapply(1:last, seq, length.out=20)
str(idx)
#  num [1:20, 1:131] 1 2 3 4 5 6 7 8 9 10 ...

So we have 131 columns and each column identifies a group of 20 rows for a regression. Now compute the 131 regressions and save the coefficients:

results <- lapply(1:131, function(x) lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris[idx[, x], ]))
coeffs <- t(sapply(results, coef))
head(coeffs)
#      (Intercept) Sepal.Width Petal.Length Petal.Width
# [1,]  0.88165253   1.1027541    0.4335847  -1.3039612
# [2,]  0.64094220   1.1111668    0.6186075  -1.4860753
# [3,]  0.28030724   1.2120241    0.6477881  -1.7022181
# [4,] -0.01943516   1.1879500    0.8971728  -1.6773764
# [5,]  0.46106345   0.9888293    0.9230228  -0.8457783
# [6,]  0.92206667   0.9734378    0.5716684  -0.5058189

Each regression is stored as a list in results so that the first regression is results[[1]] .

summary(results[[1]])
# 
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
#     data = iris[idx[, x], ])
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.26396 -0.17137 -0.00562  0.13582  0.36386 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    0.8817     0.6730   1.310    0.209    
# Sepal.Width    1.1028     0.1748   6.309 1.04e-05 ***
# Petal.Length   0.4336     0.3448   1.257    0.227    
# Petal.Width   -1.3040     0.7924  -1.646    0.119    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.204 on 16 degrees of freedom
# Multiple R-squared:  0.8078,  Adjusted R-squared:  0.7717 
# F-statistic: 22.41 on 3 and 16 DF,  p-value: 5.666e-06

Getting a statistic computed by summary is slightly more involved:

Rsq <- sapply(results, function(x) summary(x)$adj.r.squared)
# quantile(Rsq)
#        0%       25%       50%       75%      100% 
# 0.1635166 0.4471409 0.6298927 0.8417655 0.9278258

How to Loop linear regression repeatedly for n rows in R?

Question

2 answers

solution1
0 2021-11-07 20:02:16

solution2
0 2021-11-07 20:28:55

How to Loop linear regression repeatedly for n rows in R?

Question

2 answers

solution1 0 2021-11-07 20:02:16

solution2 0 2021-11-07 20:28:55

solution1
0 2021-11-07 20:02:16

solution2
0 2021-11-07 20:28:55