[英]Linear Regression and group by in R
I want to do a linear regression in R using the lm()
function.我想使用lm()
函数在 R 中进行线性回归。 My data is an annual time series with one field for year (22 years) and another for state (50 states).我的数据是一个年度时间序列,其中一个字段为一年(22 年),另一个字段为州(50 个州)。 I want to fit a regression for each state so that at the end I have a vector of lm responses.我想为每个状态拟合一个回归,以便最后我有一个 lm 响应向量。 I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector.我可以想象为每个状态执行 for 循环,然后在循环内执行回归并将每个回归的结果添加到向量中。 That does not seem very R-like, however.然而,这似乎不太像 R。 In SAS I would do a 'by' statement and in SQL I would do a 'group by'.在 SAS 中我会做一个“by”语句,而在 SQL 中我会做一个“group by”。 What's the R way of doing this?这样做的 R 方式是什么?
Here's an approach using the plyr package:这是使用plyr包的方法:
d <- data.frame(
state = rep(c('NY', 'CA'), 10),
year = rep(1:10, 2),
response= rnorm(20)
)
library(plyr)
# Break up d by state, then fit the specified model to each piece and
# return a list
models <- dlply(d, "state", function(df)
lm(response ~ year, data = df))
# Apply coef to each model and return a data frame
ldply(models, coef)
# Print the summary of each model
l_ply(models, summary, .print = TRUE)
Since 2009, dplyr
has been released which actually provides a very nice way to do this kind of grouping, closely resembling what SAS does.自 2009 年以来, dplyr
已经发布,它实际上提供了一种非常好的方法来进行这种分组,与 SAS 的做法非常相似。
library(dplyr)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))
fitted_models = d %>% group_by(state) %>% do(model = lm(response ~ year, data = .))
# Source: local data frame [2 x 2]
# Groups: <by row>
#
# state model
# (fctr) (chr)
# 1 CA <S3:lm>
# 2 NY <S3:lm>
fitted_models$model
# [[1]]
#
# Call:
# lm(formula = response ~ year, data = .)
#
# Coefficients:
# (Intercept) year
# -0.06354 0.02677
#
#
# [[2]]
#
# Call:
# lm(formula = response ~ year, data = .)
#
# Coefficients:
# (Intercept) year
# -0.35136 0.09385
To retrieve the coefficients and Rsquared/p.value, one can use the broom
package.要检索系数和 Rsquared/p.value,可以使用broom
包。 This package provides:该软件包提供:
three S3 generics: tidy, which summarizes a model's statistical findings such as coefficients of a regression;三个 S3 泛型:tidy,它总结了模型的统计结果,例如回归系数; augment, which adds columns to the original data such as predictions, residuals and cluster assignments;增加,将列添加到原始数据中,例如预测、残差和聚类分配; and glance, which provides a one-row summary of model-level statistics.和一目了然,它提供了模型级统计信息的一行摘要。
library(broom)
fitted_models %>% tidy(model)
# Source: local data frame [4 x 6]
# Groups: state [2]
#
# state term estimate std.error statistic p.value
# (fctr) (chr) (dbl) (dbl) (dbl) (dbl)
# 1 CA (Intercept) -0.06354035 0.83863054 -0.0757668 0.9414651
# 2 CA year 0.02677048 0.13515755 0.1980687 0.8479318
# 3 NY (Intercept) -0.35135766 0.60100314 -0.5846187 0.5749166
# 4 NY year 0.09385309 0.09686043 0.9689519 0.3609470
fitted_models %>% glance(model)
# Source: local data frame [2 x 12]
# Groups: state [2]
#
# state r.squared adj.r.squared sigma statistic p.value df
# (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (int)
# 1 CA 0.004879969 -0.119510035 1.2276294 0.0392312 0.8479318 2
# 2 NY 0.105032068 -0.006838924 0.8797785 0.9388678 0.3609470 2
# Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
# df.residual (int)
fitted_models %>% augment(model)
# Source: local data frame [20 x 10]
# Groups: state [2]
#
# state response year .fitted .se.fit .resid .hat
# (fctr) (dbl) (int) (dbl) (dbl) (dbl) (dbl)
# 1 CA 0.4547765 1 -0.036769875 0.7215439 0.4915464 0.3454545
# 2 CA 0.1217003 2 -0.009999399 0.6119518 0.1316997 0.2484848
# 3 CA -0.6153836 3 0.016771076 0.5146646 -0.6321546 0.1757576
# 4 CA -0.9978060 4 0.043541551 0.4379605 -1.0413476 0.1272727
# 5 CA 2.1385614 5 0.070312027 0.3940486 2.0682494 0.1030303
# 6 CA -0.3924598 6 0.097082502 0.3940486 -0.4895423 0.1030303
# 7 CA -0.5918738 7 0.123852977 0.4379605 -0.7157268 0.1272727
# 8 CA 0.4671346 8 0.150623453 0.5146646 0.3165112 0.1757576
# 9 CA -1.4958726 9 0.177393928 0.6119518 -1.6732666 0.2484848
# 10 CA 1.7481956 10 0.204164404 0.7215439 1.5440312 0.3454545
# 11 NY -0.6285230 1 -0.257504572 0.5170932 -0.3710185 0.3454545
# 12 NY 1.0566099 2 -0.163651479 0.4385542 1.2202614 0.2484848
# 13 NY -0.5274693 3 -0.069798386 0.3688335 -0.4576709 0.1757576
# 14 NY 0.6097983 4 0.024054706 0.3138637 0.5857436 0.1272727
# 15 NY -1.5511940 5 0.117907799 0.2823942 -1.6691018 0.1030303
# 16 NY 0.7440243 6 0.211760892 0.2823942 0.5322634 0.1030303
# 17 NY 0.1054719 7 0.305613984 0.3138637 -0.2001421 0.1272727
# 18 NY 0.7513057 8 0.399467077 0.3688335 0.3518387 0.1757576
# 19 NY -0.1271655 9 0.493320170 0.4385542 -0.6204857 0.2484848
# 20 NY 1.2154852 10 0.587173262 0.5170932 0.6283119 0.3454545
# Variables not shown: .sigma (dbl), .cooksd (dbl), .std.resid (dbl)
Here's one way using the lme4
package.这是使用lme4
包的一种方法。
library(lme4)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))
xyplot(response ~ year, groups=state, data=d, type='l')
fits <- lmList(response ~ year | state, data=d)
fits
#------------
Call: lmList(formula = response ~ year | state, data = d)
Coefficients:
(Intercept) year
CA -1.34420990 0.17139963
NY 0.00196176 -0.01852429
Degrees of freedom: 20 total; 16 residual
Residual standard error: 0.8201316
In my opinion is a mixed linear model a better approach for this kind of data.在我看来,混合线性模型是处理此类数据的更好方法。 The code below given in the fixed effect the overall trend.下面的代码中给出了固定效应的整体趋势。 The random effects indicate how the trend for each individual state differ from the global trend.随机效应表明每个状态的趋势与全球趋势有何不同。 The correlation structure takes the temporal autocorrelation into account.相关结构考虑了时间自相关。 Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).看看 Pinheiro & Bates(S 和 S-Plus 中的混合效应模型)。
library(nlme)
lme(response ~ year, random = ~year|state, correlation = corAR1(~year))
A nice solution using data.table
was posted here in CrossValidated by @Zach. data.table
在 CrossValidated 中发布了一个使用data.table
的不错的解决方案。 I'd just add that it is possible to obtain iteratively also the regression coefficient r^2:我只想补充一点,也可以迭代地获得回归系数 r^2:
## make fake data
library(data.table)
set.seed(1)
dat <- data.table(x=runif(100), y=runif(100), grp=rep(1:2,50))
##calculate the regression coefficient r^2
dat[,summary(lm(y~x))$r.squared,by=grp]
grp V1
1: 1 0.01465726
2: 2 0.02256595
as well as all the other output from summary(lm)
:以及summary(lm)
所有其他输出:
dat[,list(r2=summary(lm(y~x))$r.squared , f=summary(lm(y~x))$fstatistic[1] ),by=grp]
grp r2 f
1: 1 0.01465726 0.714014
2: 2 0.02256595 1.108173
I think it's worthwhile to add the purrr::map
approach to this problem.我认为为这个问题添加purrr::map
方法是值得的。
library(tidyverse)
d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
year=rep(1:10, 2),
response=c(rnorm(10), rnorm(10)))
d %>%
group_by(state) %>%
nest() %>%
mutate(model = map(data, ~lm(response ~ year, data = .)))
See @Paul Hiemstra's answer for further ideas on using the broom
package with these results.有关使用带有这些结果的broom
包的更多想法,请参阅@Paul Hiemstra 的回答。
I now my answer comes a bit late, but I was looking for a similar functionality.我现在的答案来得有点晚,但我一直在寻找类似的功能。 It would seem the built-in function 'by' in R can also do the grouping easily: R 中的内置函数 'by' 似乎也可以轻松地进行分组:
?by contains the following example, which fits per group and extracts the coefficients with sapply: ?by 包含以下示例,它适合每个组并使用 sapply 提取系数:
require(stats)
## now suppose we want to extract the coefficients by group
tmp <- with(warpbreaks,
by(warpbreaks, tension,
function(x) lm(breaks ~ wool, data = x)))
sapply(tmp, coef)
## make fake data
ngroups <- 2
group <- 1:ngroups
nobs <- 100
dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
head(dta)
#--------------------
group y x
1 1 0.6482007 0.5429575
2 1 -0.4637118 0.7052843
3 1 -0.5129840 0.7312955
4 1 -0.6612649 0.9028034
5 1 -0.5197448 0.1661308
6 1 0.4240346 0.8944253
#------------
## function to extract the results of one model
foo <- function(z) {
## coef and se in a data frame
mr <- data.frame(coef(summary(lm(y~x,data=z))))
## put row names (predictors/indep variables)
mr$predictor <- rownames(mr)
mr
}
## see that it works
foo(subset(dta,group==1))
#=========
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept)
x -0.3669890 0.3321875 -1.104765 0.2719666 x
#----------
## one option: use command by
res <- by(dta,dta$group,foo)
res
#=========
dta$group: 1
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept)
x -0.3669890 0.3321875 -1.104765 0.2719666 x
------------------------------------------------------------
dta$group: 2
Estimate Std..Error t.value Pr...t.. predictor
(Intercept) -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept)
x 0.06286456 0.3020321 0.2081387 0.8355526 x
## using package plyr is better
library(plyr)
res <- ddply(dta,"group",foo)
res
#----------
group Estimate Std..Error t.value Pr...t.. predictor
1 1 0.21764767 0.1919140 1.1340897 0.2595235 (Intercept)
2 1 -0.36698898 0.3321875 -1.1047647 0.2719666 x
3 2 -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept)
4 2 0.06286456 0.3020321 0.2081387 0.8355526 x
The lm()
function above is an simple example.上面的lm()
函数是一个简单的例子。 By the way, I imagine that your database has the columns as in the following form:顺便说一下,我想象您的数据库具有以下形式的列:
year state var1 var2 y...年份状态 var1 var2 y...
In my point of view, you can to use the following code:在我看来,您可以使用以下代码:
require(base)
library(base)
attach(data) # data = your data base
#state is your label for the states column
modell<-by(data, data$state, function(data) lm(y~I(1/var1)+I(1/var2)))
summary(modell)
The question seems to be about how to call regression functions with formulas which are modified inside a loop.问题似乎是关于如何使用在循环内修改的公式调用回归函数。
Here is how you can do it in (using diamonds dataset):以下是您可以这样做的方法(使用钻石数据集):
attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)
formula <- list(); model <- list()
for (i in 1:1) {
formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
model[[i]] = glm(formula[[i]])
#then you can plot the results or anything else ...
png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
par(mfrow = c(2, 2))
plot(model[[i]])
dev.off()
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.