简体   繁体   中英

How to Loop linear regression repeatedly for n rows in R?

I am currently working on a dataset that looks like this

Streamflow x1 x2 x3 x4
145 24 25 43 44
198 30 45 66 54
285 32 39 54 47
... ... ... ... ..

all the way down 4408 rows in total. What I want to do is conduct linear regression of streamflow ~ x1 + x2 + x3 + x4 from row 1 to row 20, then from row 2 to row 21, and the from row 3 to row 22, till the end so that I can get a set of coefficient every run. I know that I might need for to start the loop, but I just cannot figure out how to have it work on every 20th rows. Any suggestion will be really appreciated. Thank you in advance.

for(i in 1:nrow(CFbasin)) {
  y <- CFbasin[i:(i+20), 2]
  x1 <- CFbasin[i:(i+20), 3]
  x2 <- CFbasin[i:(i+20), 4]
  mod_coef[i] <- coef(lm(y ~ x1 + x2))
}

So this is what I wrote and it doesn't give me the ideal results

You could do something like this, if I understand correctly what you mean:

  1. Until this line mutate(group20rows = as.integer(gl(n(), 20, n())), .before=1) %>% it is only preparation of fake data
  2. Create a sequence of 20 with your desired column.
  3. group and apply group_split , then you have these groups of data each 20
  4. apply map_dfr to iterate over each group your regression
  5. use glance() from broom package to show all in one dataframe in a tidy from
library(tidyverse)
library(broom)

colnames1 <- c("streamflow", "x1", "x2", "x3", "x4")

iris %>% 
  select(-5) %>% 
  mutate(Sepal.Length1 = Sepal.Length) %>% 
  rename_with(~colnames1) %>% 
  mutate(streamflow=streamflow*100) %>% 
  mutate(group20rows = as.integer(gl(n(), 20, n())), .before=1) %>% 
  mutate(group20rows = as_factor(group20rows)) %>% 
  group_by(group20rows) %>% 
  group_split() %>% 
  map_dfr(.f = function(df){
    lm(streamflow ~ x1+x2+x3, data = df) %>% 
      glance() %>% 
      add_column(group20rows = unique(df$group20rows), .before=1)
  })

output:

  group20rows r.squared adj.r.squared sigma statistic     p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <fct>           <dbl>         <dbl> <dbl>     <dbl>       <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 1               0.808         0.772  20.4     22.4  0.00000567      3  -86.5 183.   188.    6662.          16    20
2 2               0.335         0.210  26.2      2.69 0.0815          3  -91.4 193.   198.   10961.          16    20
3 3               0.873         0.850  32.0     36.8  0.000000207     3  -95.5 201.   206.   16420.          16    20
4 4               0.511         0.419  34.4      5.57 0.00819         3  -96.9 204.   209.   18919.          16    20
5 5               0.530         0.442  30.0      6.01 0.00609         3  -94.2 198.   203.   14417.          16    20
6 6               0.877         0.854  28.0     38.0  0.000000165     3  -92.8 196.   200.   12502.          16    20
7 7               0.787         0.747  32.5     19.7  0.0000128       3  -95.8 202.   207.   16895.          16    20
8 8               0.644         0.466  28.0      3.62 0.0845          3  -45.0  99.9  101.    4719.           6    10

Here is a base R approach using the iris data set also:

data(iris)
str(iris)
# 'data.frame': 150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Since there are 150 rows, we compute the number of rolling groups of 20 and use that to create a matrix with 20 rows and 131 columns listing the row numbers to be used in each regression:

rows <- nrow(iris)
last <- rows + 1 - 20
idx <- sapply(1:last, seq, length.out=20)
str(idx)
#  num [1:20, 1:131] 1 2 3 4 5 6 7 8 9 10 ...

So we have 131 columns and each column identifies a group of 20 rows for a regression. Now compute the 131 regressions and save the coefficients:

results <- lapply(1:131, function(x) lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, iris[idx[, x], ]))
coeffs <- t(sapply(results, coef))
head(coeffs)
#      (Intercept) Sepal.Width Petal.Length Petal.Width
# [1,]  0.88165253   1.1027541    0.4335847  -1.3039612
# [2,]  0.64094220   1.1111668    0.6186075  -1.4860753
# [3,]  0.28030724   1.2120241    0.6477881  -1.7022181
# [4,] -0.01943516   1.1879500    0.8971728  -1.6773764
# [5,]  0.46106345   0.9888293    0.9230228  -0.8457783
# [6,]  0.92206667   0.9734378    0.5716684  -0.5058189

Each regression is stored as a list in results so that the first regression is results[[1]] .

summary(results[[1]])
# 
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
#     data = iris[idx[, x], ])
# 
# Residuals:
#      Min       1Q   Median       3Q      Max 
# -0.26396 -0.17137 -0.00562  0.13582  0.36386 
# 
# Coefficients:
#              Estimate Std. Error t value Pr(>|t|)    
# (Intercept)    0.8817     0.6730   1.310    0.209    
# Sepal.Width    1.1028     0.1748   6.309 1.04e-05 ***
# Petal.Length   0.4336     0.3448   1.257    0.227    
# Petal.Width   -1.3040     0.7924  -1.646    0.119    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# Residual standard error: 0.204 on 16 degrees of freedom
# Multiple R-squared:  0.8078,  Adjusted R-squared:  0.7717 
# F-statistic: 22.41 on 3 and 16 DF,  p-value: 5.666e-06

Getting a statistic computed by summary is slightly more involved:

Rsq <- sapply(results, function(x) summary(x)$adj.r.squared)
# quantile(Rsq)
#        0%       25%       50%       75%      100% 
# 0.1635166 0.4471409 0.6298927 0.8417655 0.9278258 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM