简体   繁体   English

使用R中的dlply()对每列具有因变量的子集进行线性回归

[英]Linear regression on subsets with dependent variable per column using dlply() in R

I would like to automatically produce linear regressions for a data frame for each category separately. 我想分别为每个类别的数据框自动生成线性回归。

My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable. 我的数据框包括一列有时间类别,一列(斜率$ Abs)作为因变量,几列,应该用作自变量。

head(slope)
   timepoint   Abs      In1      In2      In3     Out1     Out2     Out3 ...
1:        t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2:        t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3:        t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...

All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. 总而言之,对于每个时间点,我有40个变量,我想最终得到每个组合的线性回归。 Such as In1~Abs[t0], In1~Abs[t1] and so on for each column. 如每列的In1~Abs [t0],In1~Abs [t1]等。 Of course I can do this manually, but I guess there must be a more elegant way to do the work. 当然我可以手动执行此操作,但我想必须有一种更优雅的方式来完成这项工作。

I did my research and found out that dlply() might be the function I'm looking for. 我做了我的研究,发现dlply()可能是我正在寻找的功能。 However, my attempt results in an error. 但是,我的尝试导致错误。

So I somehow tried to combine the answers from previous questions I have found: On individual variables per column and on subsets per category 所以我试图结合我之前发现的问题的答案: 每列的单个变量每个类别的子集

I came up with a function like this: 我想出了这样一个函数:

lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )

But I get the following error: 但是我收到以下错误:

Error in eval.quoted(.variables, data) : 
   envir must be either NULL, a list, or an environment.

Hope someone can help me out. 希望有人可以帮助我。

Thanks a lot in advance! 非常感谢提前!

The dplyr package in R does not do well in accepting formulas in the form of y~x into its functions based on my research. 根据我的研究,R中的dplyr包不能很好地接受y~x形式的公式到其函数中。 So the other alternative is to calculate it someone manually. 所以另一种选择是手动计算它。 Now let me first inform you that slope = cor(x,y)*sd(y)/sd(x) (reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html ) and that the intercept = mean(y) - slope*mean(x) . 现在让我首先告诉你, slope = cor(x,y)*sd(y)/sd(x) (参考资料来自: http//faculty.cas.usf.edu/mbrannick/regression/regbas.html )并且intercept = mean(y) - slope*mean(x) Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. 简单的线性回归要求我们在找到截距时使用质心作为我们的参考点,因为它是一个无偏估计。 Using a single point will only get you the intercept of that individual point and not the overall intercept. 使用单个点只能获得该点的截距,而不是整个截距。

Now for this explanation, I will be using the mtcars data set. 现在为了解释,我将使用mtcars数据集。 I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec') to basically mimic your dataset. 我只想要一个数据的子集,所以我使用变量c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')基本上模仿你的数据集。 In my example, my grouping variable is 'cyl' , which is the equivalent of your 'timepoint' variable. 在我的例子中,我的分组变量是'cyl' ,它相当于你的'timepoint'变量。 The variable 'mpg' is the y -variable in this case, which is equivalent to 'Abs' in your data. 变量'mpg'在这种情况下是y变量,相当于数据中的'Abs'

Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable. 基于我对上面的斜率和截距的解释,显然我们需要三个表/数据集: y为每个组的x的相关数据集,每个变量和组的标准偏差表,以及表示每个组和每个变量。

To get the correlation dataset, we want to group by 'cyl' and calculate the correlation coefficients for , you should use: 为了得到相关数据集,我们想要按'cyl'分组并计算相关系数,你应该使用:

df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))

Because of the way my dataset is structured, the second variable (df[ ,2]) is 'cyl' . 由于我的数据集的结构方式,第二个变量(df[ ,2])'cyl' For you, you should use 对你来说,你应该使用

do(head(data.frame(cor(.[,c(2:40)])), n = 1)))

since your first column is the grouping variable and it is not numeric. 因为您的第一列是分组变量而不是数字。 Essentially, you want to go across all numeric variables. 基本上,您想要遍历所有数字变量。 Not using head will produce a correlation matrix, but since you are interested in finding the slope independent of each other x -variable, you only need the row that has the correlation coefficient of your y -variable equal to 1 ( r_yy = 1 ). 不使用head将产生相关矩阵,但由于您有兴趣找到彼此独立的斜率x变量,因此您只需要具有y变量相关系数等于1的行( r_yy = 1 )。

To get standard deviation and means for each group, each variable, use 要获得每个组的标准偏差和均值,请使用每个变量

sds     <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means   <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))

Your group names will be the first column, so make sure to rename your rows for each dataset corrs , sds , and means and delete column 1. 您的组名称将是第一列,因此请确保为每个数据集corrssdsmeans重命名行并删除第1列。

rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]

Now we need to calculate the sd(y)/sd(x) . 现在我们需要计算sd(y)/sd(x) The best way I have done this, and seen it done is using an apply affiliated function. 我做到这一点的最好方法,并看到它完成使用apply附属功能。

sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))

I use X[1] because the first variable in sds is my y -variable. 我使用X[1]因为sds的第一个变量是我的y变量。 The first variable after you have deleted timepoint is Abs which is your y -variable. 删除timepoint后的第一个变量是Abs ,它是你的y变量。 So use that. 所以使用它。

Now the rest is pretty straight forward. 现在剩下的就是直截了当。 Since everything is saved as a data frame, to find slope, all it you need to do is 由于所有内容都保存为数据框,因此要找到斜率,您需要做的就是

slopes    <- sdst*corrs
inter     <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))

Again here, since our y -variable is in the first column, we use x[1] . 在这里,由于我们的y变量在第一列,我们使用x[1] To check if all is well, your slopes for your y -variable should be 1 and the intercept should be 0. 要检查一切是否正常, y -variable的斜率应为1,截距应为0。

I have solved the issue with a simpler approach, so I wanted to update the answer. 我用更简单的方法解决了这个问题,所以我想更新答案。

To make life easier I converted the data frame structure so that all columns are converted into rows with the melt() function of the reshape package. 为了简化生活,我转换了数据框架结构,以便使用reshape包的melt()函数将所有列转换为行。

melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")

The output's column name is by default "value". 输出的列名默认为“value”。

Then create one column that adds both predictors with paste() . 然后创建一个列,使用paste()添加两个预测变量。

slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")

Run a function through the dataset to create separate models for each treatment combination. 通过数据集运行函数,为每个处理组合创建单独的模型。

models <- dlply(slope, ~ FullTreat, function(df) { 
          lm(value ~ Abs, data = df)
          })

To extract the coefficents simply run 简单地运行提取系数

coefs <- ldply(models, coef)

Then split the FullTreat column into separate columns again with colsplit() also from reshape . 然后再次用分裂FullTreat柱成单独的列colsplit()从也reshape Plus, add the Intercept and slope to the new data frame: 另外,将Intercept和slope添加到新数据框:

coefs <- cbind(colsplit(coefs$FullTreat, split="_",
         c("Sites","Timepoint")), coefs[,2:3])

I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply() function. 我没有处理过绘制模型中所有回归的函数,但我想这对于ldply()函数是可行的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM