简体   繁体   English

R中的线性回归和分组依据

[英]Linear Regression and group by in R

I want to do a linear regression in R using the lm() function.我想使用lm()函数在 R 中进行线性回归。 My data is an annual time series with one field for year (22 years) and another for state (50 states).我的数据是一个年度时间序列,其中一个字段为一年(22 年),另一个字段为州(50 个州)。 I want to fit a regression for each state so that at the end I have a vector of lm responses.我想为每个状态拟合一个回归,以便最后我有一个 lm 响应向量。 I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector.我可以想象为每个状态执行 for 循环,然后在循环内执行回归并将每个回归的结果添加到向量中。 That does not seem very R-like, however.然而,这似乎不太像 R。 In SAS I would do a 'by' statement and in SQL I would do a 'group by'.在 SAS 中我会做一个“by”语句,而在 SQL 中我会做一个“group by”。 What's the R way of doing this?这样做的 R 方式是什么?

Here's an approach using the plyr package:这是使用plyr包的方法:

d <- data.frame(
  state = rep(c('NY', 'CA'), 10),
  year = rep(1:10, 2),
  response= rnorm(20)
)

library(plyr)
# Break up d by state, then fit the specified model to each piece and
# return a list
models <- dlply(d, "state", function(df) 
  lm(response ~ year, data = df))

# Apply coef to each model and return a data frame
ldply(models, coef)

# Print the summary of each model
l_ply(models, summary, .print = TRUE)

Since 2009, dplyr has been released which actually provides a very nice way to do this kind of grouping, closely resembling what SAS does.自 2009 年以来, dplyr已经发布,它实际上提供了一种非常好的方法来进行这种分组,与 SAS 的做法非常相似。

library(dplyr)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                year=rep(1:10, 2),
                response=c(rnorm(10), rnorm(10)))
fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
# Source: local data frame [2 x 2]
# Groups: <by row>
#
#    state   model
#   (fctr)   (chr)
# 1     CA <S3:lm>
# 2     NY <S3:lm>
fitted_models$model
# [[1]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.06354      0.02677  
#
#
# [[2]]
# 
# Call:
# lm(formula = response ~ year, data = .)
# 
# Coefficients:
# (Intercept)         year  
#    -0.35136      0.09385  

To retrieve the coefficients and Rsquared/p.value, one can use the broom package.要检索系数和 Rsquared/p.value,可以使用broom包。 This package provides:该软件包提供:

three S3 generics: tidy, which summarizes a model's statistical findings such as coefficients of a regression;三个 S3 泛型:tidy,它总结了模型的统计结果,例如回归系数; augment, which adds columns to the original data such as predictions, residuals and cluster assignments;增加,将列添加到原始数据中,例如预测、残差和聚类分配; and glance, which provides a one-row summary of model-level statistics.和一目了然,它提供了模型级统计信息的一行摘要。

library(broom)
fitted_models %>% tidy(model)
# Source: local data frame [4 x 6]
# Groups: state [2]
# 
#    state        term    estimate  std.error  statistic   p.value
#   (fctr)       (chr)       (dbl)      (dbl)      (dbl)     (dbl)
# 1     CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651
# 2     CA        year  0.02677048 0.13515755  0.1980687 0.8479318
# 3     NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166
# 4     NY        year  0.09385309 0.09686043  0.9689519 0.3609470
fitted_models %>% glance(model)
# Source: local data frame [2 x 12]
# Groups: state [2]
# 
#    state   r.squared adj.r.squared     sigma statistic   p.value    df
#   (fctr)       (dbl)         (dbl)     (dbl)     (dbl)     (dbl) (int)
# 1     CA 0.004879969  -0.119510035 1.2276294 0.0392312 0.8479318     2
# 2     NY 0.105032068  -0.006838924 0.8797785 0.9388678 0.3609470     2
# Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
#   df.residual (int)
fitted_models %>% augment(model)
# Source: local data frame [20 x 10]
# Groups: state [2]
# 
#     state   response  year      .fitted   .se.fit     .resid      .hat
#    (fctr)      (dbl) (int)        (dbl)     (dbl)      (dbl)     (dbl)
# 1      CA  0.4547765     1 -0.036769875 0.7215439  0.4915464 0.3454545
# 2      CA  0.1217003     2 -0.009999399 0.6119518  0.1316997 0.2484848
# 3      CA -0.6153836     3  0.016771076 0.5146646 -0.6321546 0.1757576
# 4      CA -0.9978060     4  0.043541551 0.4379605 -1.0413476 0.1272727
# 5      CA  2.1385614     5  0.070312027 0.3940486  2.0682494 0.1030303
# 6      CA -0.3924598     6  0.097082502 0.3940486 -0.4895423 0.1030303
# 7      CA -0.5918738     7  0.123852977 0.4379605 -0.7157268 0.1272727
# 8      CA  0.4671346     8  0.150623453 0.5146646  0.3165112 0.1757576
# 9      CA -1.4958726     9  0.177393928 0.6119518 -1.6732666 0.2484848
# 10     CA  1.7481956    10  0.204164404 0.7215439  1.5440312 0.3454545
# 11     NY -0.6285230     1 -0.257504572 0.5170932 -0.3710185 0.3454545
# 12     NY  1.0566099     2 -0.163651479 0.4385542  1.2202614 0.2484848
# 13     NY -0.5274693     3 -0.069798386 0.3688335 -0.4576709 0.1757576
# 14     NY  0.6097983     4  0.024054706 0.3138637  0.5857436 0.1272727
# 15     NY -1.5511940     5  0.117907799 0.2823942 -1.6691018 0.1030303
# 16     NY  0.7440243     6  0.211760892 0.2823942  0.5322634 0.1030303
# 17     NY  0.1054719     7  0.305613984 0.3138637 -0.2001421 0.1272727
# 18     NY  0.7513057     8  0.399467077 0.3688335  0.3518387 0.1757576
# 19     NY -0.1271655     9  0.493320170 0.4385542 -0.6204857 0.2484848
# 20     NY  1.2154852    10  0.587173262 0.5170932  0.6283119 0.3454545
# Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl)

Here's one way using the lme4 package.这是使用lme4包的一种方法。

 library(lme4)
 d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                 year=rep(1:10, 2),
                 response=c(rnorm(10), rnorm(10)))

 xyplot(response ~ year, groups=state, data=d, type='l')

 fits <- lmList(response ~ year | state, data=d)
 fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
   (Intercept)        year
CA -1.34420990  0.17139963
NY  0.00196176 -0.01852429

Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316

In my opinion is a mixed linear model a better approach for this kind of data.在我看来,混合线性模型是处理此类数据的更好方法。 The code below given in the fixed effect the overall trend.下面的代码中给出了固定效应的整体趋势。 The random effects indicate how the trend for each individual state differ from the global trend.随机效应表明每个状态的趋势与全球趋势有何不同。 The correlation structure takes the temporal autocorrelation into account.相关结构考虑了时间自相关。 Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).看看 Pinheiro & Bates(S 和 S-Plus 中的混合效应模型)。

library(nlme)
lme(response ~ year, random = ~year|state, correlation = corAR1(~year))

A nice solution using data.table was posted here in CrossValidated by @Zach. data.table在 CrossValidated 中发布一个使用data.table的不错的解决方案。 I'd just add that it is possible to obtain iteratively also the regression coefficient r^2:我只想补充一点,也可以迭代地获得回归系数 r^2:

## make fake data
    library(data.table)
    set.seed(1)
    dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))

##calculate the regression coefficient r^2
    dat[,summary(lm(y~x))$r.squared,by=grp]
       grp         V1
    1:   1 0.01465726
    2:   2 0.02256595

as well as all the other output from summary(lm) :以及summary(lm)所有其他输出:

dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp]
   grp         r2        f
1:   1 0.01465726 0.714014
2:   2 0.02256595 1.108173

I think it's worthwhile to add the purrr::map approach to this problem.我认为为这个问题添加purrr::map方法是值得的。

library(tidyverse)

d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
                                 year=rep(1:10, 2),
                                 response=c(rnorm(10), rnorm(10)))

d %>% 
  group_by(state) %>% 
  nest() %>% 
  mutate(model = map(data, ~lm(response ~ year, data = .)))

See @Paul Hiemstra's answer for further ideas on using the broom package with these results.有关使用带有这些结果的broom包的更多想法,请参阅@Paul Hiemstra 的回答。

I now my answer comes a bit late, but I was looking for a similar functionality.我现在的答案来得有点晚,但我一直在寻找类似的功能。 It would seem the built-in function 'by' in R can also do the grouping easily: R 中的内置函数 'by' 似乎也可以轻松地进行分组:

?by contains the following example, which fits per group and extracts the coefficients with sapply: ?by 包含以下示例,它适合每个组并使用 sapply 提取系数:

require(stats)
## now suppose we want to extract the coefficients by group 
tmp <- with(warpbreaks,
            by(warpbreaks, tension,
               function(x) lm(breaks ~ wool, data = x)))
sapply(tmp, coef)
## make fake data
 ngroups <- 2
 group <- 1:ngroups
 nobs <- 100
 dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
 head(dta)
#--------------------
  group          y         x
1     1  0.6482007 0.5429575
2     1 -0.4637118 0.7052843
3     1 -0.5129840 0.7312955
4     1 -0.6612649 0.9028034
5     1 -0.5197448 0.1661308
6     1  0.4240346 0.8944253
#------------ 
## function to extract the results of one model
 foo <- function(z) {
   ## coef and se in a data frame
   mr <- data.frame(coef(summary(lm(y~x,data=z))))
   ## put row names (predictors/indep variables)
   mr$predictor <- rownames(mr)
   mr
 }
 ## see that it works
 foo(subset(dta,group==1))
#=========
              Estimate Std..Error   t.value  Pr...t..   predictor
(Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
x           -0.3669890  0.3321875 -1.104765 0.2719666           x
#----------
## one option: use command by
 res <- by(dta,dta$group,foo)
 res
#=========
dta$group: 1
              Estimate Std..Error   t.value  Pr...t..   predictor
(Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
x           -0.3669890  0.3321875 -1.104765 0.2719666           x
------------------------------------------------------------ 
dta$group: 2
               Estimate Std..Error    t.value  Pr...t..   predictor
(Intercept) -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
x            0.06286456  0.3020321  0.2081387 0.8355526           x

## using package plyr is better
 library(plyr)
 res <- ddply(dta,"group",foo)
 res
#----------
  group    Estimate Std..Error    t.value  Pr...t..   predictor
1     1  0.21764767  0.1919140  1.1340897 0.2595235 (Intercept)
2     1 -0.36698898  0.3321875 -1.1047647 0.2719666           x
3     2 -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
4     2  0.06286456  0.3020321  0.2081387 0.8355526           x

The lm() function above is an simple example.上面的lm()函数是一个简单的例子。 By the way, I imagine that your database has the columns as in the following form:顺便说一下,我想象您的数据库具有以下形式的列:

year state var1 var2 y...年份状态 var1 var2 y...

In my point of view, you can to use the following code:在我看来,您可以使用以下代码:

require(base) 
library(base) 
attach(data) # data = your data base
             #state is your label for the states column
modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2)))
summary(modell)

The question seems to be about how to call regression functions with formulas which are modified inside a loop.问题似乎是关于如何使用在循环内修改的公式调用回归函数。

Here is how you can do it in (using diamonds dataset):以下是您可以这样做的方法(使用钻石数据集):

attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)

formula <- list(); model <- list()
for (i in 1:1) {
  formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
  model[[i]] = glm(formula[[i]]) 

  #then you can plot the results or anything else ...
  png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
  par(mfrow = c(2, 2))      
  plot(model[[i]])
  dev.off()
  }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM