[英]Linear regression on subsets with dependent variable per column using dlply() in R
I would like to automatically produce linear regressions for a data frame for each category separately. 我想分别为每个类别的数据框自动生成线性回归。
My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable. 我的数据框包括一列有时间类别,一列(斜率$ Abs)作为因变量,几列,应该用作自变量。
head(slope)
timepoint Abs In1 In2 In3 Out1 Out2 Out3 ...
1: t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2: t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3: t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...
All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. 总而言之,对于每个时间点,我有40个变量,我想最终得到每个组合的线性回归。 Such as In1~Abs[t0], In1~Abs[t1] and so on for each column.
如每列的In1~Abs [t0],In1~Abs [t1]等。 Of course I can do this manually, but I guess there must be a more elegant way to do the work.
当然我可以手动执行此操作,但我想必须有一种更优雅的方式来完成这项工作。
I did my research and found out that dlply()
might be the function I'm looking for. 我做了我的研究,发现
dlply()
可能是我正在寻找的功能。 However, my attempt results in an error. 但是,我的尝试导致错误。
So I somehow tried to combine the answers from previous questions I have found: On individual variables per column and on subsets per category 所以我试图结合我之前发现的问题的答案: 每列的单个变量和每个类别的子集
I came up with a function like this: 我想出了这样一个函数:
lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )
But I get the following error: 但是我收到以下错误:
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
Hope someone can help me out. 希望有人可以帮助我。
Thanks a lot in advance! 非常感谢提前!
The dplyr
package in R does not do well in accepting formulas in the form of y~x
into its functions based on my research. 根据我的研究,R中的
dplyr
包不能很好地接受y~x
形式的公式到其函数中。 So the other alternative is to calculate it someone manually. 所以另一种选择是手动计算它。 Now let me first inform you that
slope = cor(x,y)*sd(y)/sd(x)
(reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html ) and that the intercept = mean(y) - slope*mean(x)
. 现在让我首先告诉你,
slope = cor(x,y)*sd(y)/sd(x)
(参考资料来自: http : //faculty.cas.usf.edu/mbrannick/regression/regbas.html )并且intercept = mean(y) - slope*mean(x)
。 Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. 简单的线性回归要求我们在找到截距时使用质心作为我们的参考点,因为它是一个无偏估计。 Using a single point will only get you the intercept of that individual point and not the overall intercept.
使用单个点只能获得该点的截距,而不是整个截距。
Now for this explanation, I will be using the mtcars
data set. 现在为了解释,我将使用
mtcars
数据集。 I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')
to basically mimic your dataset. 我只想要一个数据的子集,所以我使用变量
c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')
基本上模仿你的数据集。 In my example, my grouping variable is 'cyl'
, which is the equivalent of your 'timepoint' variable. 在我的例子中,我的分组变量是
'cyl'
,它相当于你的'timepoint'变量。 The variable 'mpg'
is the y -variable in this case, which is equivalent to 'Abs'
in your data. 变量
'mpg'
在这种情况下是y变量,相当于数据中的'Abs'
。
Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable. 基于我对上面的斜率和截距的解释,显然我们需要三个表/数据集: y为每个组的x的相关数据集,每个变量和组的标准偏差表,以及表示每个组和每个变量。
To get the correlation dataset, we want to group by 'cyl'
and calculate the correlation coefficients for , you should use: 为了得到相关数据集,我们想要按
'cyl'
分组并计算相关系数,你应该使用:
df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))
Because of the way my dataset is structured, the second variable (df[ ,2])
is 'cyl'
. 由于我的数据集的结构方式,第二个变量
(df[ ,2])
是'cyl'
。 For you, you should use 对你来说,你应该使用
do(head(data.frame(cor(.[,c(2:40)])), n = 1)))
since your first column is the grouping variable and it is not numeric. 因为您的第一列是分组变量而不是数字。 Essentially, you want to go across all numeric variables.
基本上,您想要遍历所有数字变量。 Not using
head
will produce a correlation matrix, but since you are interested in finding the slope independent of each other x -variable, you only need the row that has the correlation coefficient of your y -variable equal to 1 ( r_yy = 1
). 不使用
head
将产生相关矩阵,但由于您有兴趣找到彼此独立的斜率x变量,因此您只需要具有y变量相关系数等于1的行( r_yy = 1
)。
To get standard deviation and means for each group, each variable, use 要获得每个组的标准偏差和均值,请使用每个变量
sds <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))
Your group names will be the first column, so make sure to rename your rows for each dataset corrs
, sds
, and means
and delete column 1. 您的组名称将是第一列,因此请确保为每个数据集
corrs
, sds
和means
重命名行并删除第1列。
rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]
Now we need to calculate the sd(y)/sd(x)
. 现在我们需要计算
sd(y)/sd(x)
。 The best way I have done this, and seen it done is using an apply
affiliated function. 我做到这一点的最好方法,并看到它完成使用
apply
附属功能。
sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))
I use X[1]
because the first variable in sds
is my y -variable. 我使用
X[1]
因为sds
的第一个变量是我的y变量。 The first variable after you have deleted timepoint
is Abs
which is your y -variable. 删除
timepoint
后的第一个变量是Abs
,它是你的y变量。 So use that. 所以使用它。
Now the rest is pretty straight forward. 现在剩下的就是直截了当。 Since everything is saved as a data frame, to find slope, all it you need to do is
由于所有内容都保存为数据框,因此要找到斜率,您需要做的就是
slopes <- sdst*corrs
inter <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))
Again here, since our y -variable is in the first column, we use x[1]
. 在这里,由于我们的y变量在第一列,我们使用
x[1]
。 To check if all is well, your slopes for your y -variable should be 1 and the intercept should be 0. 要检查一切是否正常, y -variable的斜率应为1,截距应为0。
I have solved the issue with a simpler approach, so I wanted to update the answer. 我用更简单的方法解决了这个问题,所以我想更新答案。
To make life easier I converted the data frame structure so that all columns are converted into rows with the melt()
function of the reshape
package. 为了简化生活,我转换了数据框架结构,以便使用
reshape
包的melt()
函数将所有列转换为行。
melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")
The output's column name is by default "value". 输出的列名默认为“value”。
Then create one column that adds both predictors with paste()
. 然后创建一个列,使用
paste()
添加两个预测变量。
slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")
Run a function through the dataset to create separate models for each treatment combination. 通过数据集运行函数,为每个处理组合创建单独的模型。
models <- dlply(slope, ~ FullTreat, function(df) {
lm(value ~ Abs, data = df)
})
To extract the coefficents simply run 简单地运行提取系数
coefs <- ldply(models, coef)
Then split the FullTreat column into separate columns again with colsplit()
also from reshape
. 然后再次用分裂FullTreat柱成单独的列
colsplit()
从也reshape
。 Plus, add the Intercept and slope to the new data frame: 另外,将Intercept和slope添加到新数据框:
coefs <- cbind(colsplit(coefs$FullTreat, split="_",
c("Sites","Timepoint")), coefs[,2:3])
I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply()
function. 我没有处理过绘制模型中所有回归的函数,但我想这对于
ldply()
函数是可行的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.