简体   繁体   English

R按组逐列将回归系数存储在数据帧中

[英]R Storing regression coefficients in data frame column by group

I have a data frame with results from a survey. 我有一个包含调查结果的数据框。 The results are stored in a verticalized format. 结果以垂直格式存储。 The data frame looks like this - 数据框看起来像这样-

set.seed(1000)

df = data.frame(RESP_ID=c(rep(1,6),rep(2,8),rep(3,9),rep(4,5),rep(5,4),rep(6,10),rep(7,4),rep(8,8),rep(9,8),rep(10,10)),
                CLIENT=c(rep("A",6),rep("A",8),rep("A",9),rep("A",5),rep("A",4),rep("B",10),rep("B",4),rep("B",8),rep("B",8),rep("B",10)),
                QST=c(paste0("Q",c(1:6)),paste0("Q",c(1:8)),paste0("Q",c(1:9)),paste0("Q",c(1:5)),paste0("Q",c(1:4)),paste0("Q",c(1:10)),paste0("Q",c(1:4)),paste0("Q",c(1:8)),paste0("Q",c(1:8)),paste0("Q",c(1:10))),
                VALUE=round(runif(72,1,4),0))

Description of dataframe 数据框说明

RESP_ID = Respondent ID. RESP_ID =受访者ID。 Each ID correspondents to a single respondent. 每个ID通讯员都对应一个响应者。 In this sample data frame, we have 10 respondents. 在此样本数据框中,我们有10位受访者。

CLIENT = Correspondents to the name of the client whose respondents were surveyed. CLIENT =通讯录的名称,即受访者的客户名称。 In this sample data frame, we have two clients (A & B). 在此示例数据框中,我们有两个客户端(A和B)。

QST = Corresponds to the question number in the survey. QST =对应于调查中的问题编号。

VALUE = Corresponds to the answer option for the question. VALUE =对应于问题的答案选项。 All questions have 4 answer options (1 to 4). 所有问题都有4个答案选项(1至4)。

Objective 目的

For each client and question combination, I'd like to create a separate column that stores the regression coefficient for that question regressed to Q2 in the QST column. 对于每个客户和问题组合,我想创建一个单独的列,该列在QST列中存储该问题回归到Q2的回归系数。

So in the regression model, Q2 is the dependent variable, and all other questions are the independent variables. 因此,在回归模型中, Q2是因变量,所有其他问题都是自变量。

My attempt 我的尝试

My attempt is not giving me the result I want. 我的尝试没有给我想要的结果。

slopesdf = df %>%
  spread(QST, VALUE, fill = 0) %>%
  group_by(CLIENT) %>%
  mutate(COEFFICIENT=lm(Q2 ~ .))

I am trying to first group by CLIENT & QST and then find the slopes for each question regressed with Q2. 我试图QST CLIENTQST分组,然后为每个与Q2回归的问题找到斜率。 I'm sure there's a better way of doing this. 我相信有更好的方法可以做到这一点。

Currently, my attempt gives me the following error message - 目前,我的尝试给我以下错误消息-

Error in mutate_impl(.data, dots) : Evaluation error: '.' mutate_impl(.data,点)中的错误:评估错误:“。” dans la formule et pas d'argument 'data' dans la formule et pas d'argument'data'

Desired output 所需的输出

I'd like the final data frame to contain three columns: one for CLIENT , one for QST and a third called COEFFICIENT with the coefficients for each combination of CLIENT and QST regressed with Q2 as response variable. 我希望最后一个数据帧包含三列:一列用于CLIENT ,一列用于QST ,第三列称为COEFFICIENT ,其中CLIENT和QST的每种组合的系数都以Q2作为响应变量进行回归。

Any help on this would be greatly appreciated. 任何帮助,将不胜感激。

I'm not 100% sure that this output is what you're after, but, is this on the right track? 我不确定100%是否确定您要得到的输出,但这是否正确?

df2 <- df %>%
  spread(QST, VALUE, fill = 0) %>%
  split(.$CLIENT) %>%
  lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients }) %>%
  do.call(rbind, .) %>%
  data.frame(.) %>%
  mutate(CLIENT = rownames(.)) %>%
  gather(QST, COEFFICIENT, -CLIENT) %>%
  arrange(CLIENT)


> df2
   CLIENT          QST   COEFFICIENT
1       A X.Intercept. -1.200000e+01
2       A           Q1  1.000000e+00
3       A          Q10            NA
4       A           Q3  2.000000e+00
5       A           Q4  3.000000e+00
6       A           Q5  5.000000e-01
7       A           Q6            NA
8       A           Q7            NA
9       A           Q8            NA
10      A           Q9            NA
11      B X.Intercept.  5.000000e+00
12      B           Q1 -1.326970e-16
13      B          Q10  1.666667e+00
14      B           Q3  3.726559e-15
15      B           Q4 -2.000000e+00
16      B           Q5            NA
17      B           Q6            NA
18      B           Q7            NA
19      B           Q8            NA
20      B           Q9            NA

Edit: 编辑:

Running the splitting component only generates a list of wide-format dataframes for each client: 运行拆分组件只会为每个客户端生成一个宽格式数据帧的列表:

df %>%
  spread(QST, VALUE, fill = 0) %>%
  split(.$CLIENT) 

$A
  RESP_ID CLIENT Q1 Q10 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
1       1      A  4   0  1  4  3  3  2  0  0  0
2       2      A  2   0  2  2  3  2  4  4  3  0
3       3      A  2   0  2  3  3  1  2  4  2  3
4       4      A  3   0  3  4  2  1  0  0  0  0
5       5      A  3   0  4  4  3  0  0  0  0  0

$B
   RESP_ID CLIENT Q1 Q10 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
6        6      B  3   2  3  2  3  2  2  1  3  3
7        7      B  2   0  3  2  2  0  0  0  0  0
8        8      B  3   0  2  4  1  3  3  2  3  0
9        9      B  2   0  1  4  2  1  3  1  2  0
10      10      B  3   2  3  3  3  3  4  2  3  3

Note that all the zeroes are filling in for questions where your original data had no values- if a question wasn't answered. 请注意,如果您的原始数据没有值,则所有零填充(如果未回答问题)。 See Ben Bolker's answer on that point. 关于这一点,请参阅本博克的答案。

If you now include the code to run the lm on each of those, you get the coefficient values directly, which include the NA values seen above: 如果现在包括在每个代码上运行lm的代码,您将直接获得系数值,其中包括上面看到的NA值:

> df %>%
+   spread(QST, VALUE, fill = 0) %>%
+   split(.$CLIENT) %>%
+   lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients })
$A
(Intercept)          Q1         Q10          Q3          Q4          Q5          Q6          Q7          Q8          Q9 
  6.6666667   2.0000000          NA  -1.6666667  -0.6666667  -1.6666667          NA          NA          NA          NA 

$B
(Intercept)          Q1         Q10          Q3          Q4          Q5          Q6          Q7          Q8          Q9 
       13.0        -3.0        -0.5        -2.0          NA         2.0          NA          NA          NA          NA 

Edit 2: 编辑2:

Just to explore with a more complete dataset, if we use this df : 如果使用此df ,只是为了探索更完整的数据集:

set.seed(42)
df <-
  expand.grid(RESP_ID = 1:10,
              CLIENT = c("A", "B"),
              QST = paste("Q", 1:10, sep = "")) %>%
  mutate(VALUE = round(runif(200, 1, 4), 0))

and run the same code, we get coefficients without NA values: 并运行相同的代码,我们得到没有NA值的系数:

> df %>%
+   spread(QST, VALUE, fill = 0) %>%
+   split(.$CLIENT) %>%
+   lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients }) %>%
+   do.call(rbind, .) %>%
+   data.frame(.) %>%
+   mutate(CLIENT = rownames(.)) %>%
+   gather(QST, COEFFICIENT, -CLIENT) %>%
+   arrange(CLIENT)
   CLIENT          QST COEFFICIENT
1       A X.Intercept.  6.50000000
2       A           Q1 -4.14285714
3       A           Q3  2.50000000
4       A           Q4  0.85714286
5       A           Q5  1.00000000
6       A           Q6 -0.64285714
7       A           Q7 -1.21428571
8       A           Q8 -1.85714286
9       A           Q9  2.50000000
10      A          Q10 -0.07142857
11      B X.Intercept. -4.69924812
12      B           Q1 -0.86466165
13      B           Q3  1.56390977
14      B           Q4  1.10150376
15      B           Q5 -0.86842105
16      B           Q6  0.87593985
17      B           Q7  0.57142857
18      B           Q8  0.25187970
19      B           Q9  0.79699248
20      B          Q10 -0.12781955

A solution that follows the logic in my brain (we need to have Q2 available as a separate variable ... once we rearrange the data in that way, we can run . (The NA values are definitely due to deficiencies in this tiny data set - cases where there's no variation in the predictor, so the response can't be estimated ...) 遵循我大脑逻辑的解决方案(我们需要将Q2作为单独的变量使用...以这种方式重新排列数据后,我们就可以运行。( NA值绝对是由于此微小数据集的不足而引起的) -预测变量没有变化,因此无法估计响应的情况...)

(df
    %>% group_by(RESP_ID,CLIENT)
    ## add a new variable for Q2
    %>% mutate(Q2=VALUE[QST=="Q2"])
    ## drop the old one
    %>% filter(QST!="Q2")
    %>% group_by(CLIENT,QST)
    ## run the regression by group; return a data frame
    %>% do(as.data.frame(rbind(coef(lm(Q2~VALUE,data=.)))))
    ## convert wide coefficients to long
    %>% tidyr::gather(coef,value,-c(CLIENT,QST))
    %>% arrange(CLIENT)
)

For tasks like this, I like the "many models" approach from R for Data Science . 对于这样的任务,我喜欢R for Data Science的“许多模型”方法。 It fits in with the tidyverse style, using nested data frames and purrr::map to create a list-column of models. 它符合tidyverse样式,使用嵌套的数据框和purrr::map创建模型的列表列。 Then broom::tidy provides utilities for extracting information you need about the models. 然后broom::tidy提供实用程序,用于提取您需要的有关模型的信息。

I dropped the ID column just to get it out of the way after the data was spread, and grouped and nested by CLIENT : 我放下ID列只是为了在数据散布后摆脱它,并由CLIENT分组和嵌套:

library(tidyverse)

df %>%
  spread(key = QST, value = VALUE, fill = 0) %>%
  select(-RESP_ID) %>%
  group_by(CLIENT) %>%
  nest()
#> # A tibble: 2 x 2
#>   CLIENT data             
#>   <fct>  <list>           
#> 1 A      <tibble [5 × 10]>
#> 2 B      <tibble [5 × 10]>

After that, create a column of linear models. 之后,创建一列线性模型。 Passing quick = T to broom::tidy returns a simplified version of the model diagnostics table; quick = T传递给broom::tidy返回模型诊断表的简化版本; without setting that, you'd also get standard error, test statistic, and p-value for each variable in the model. 如果不进行设置,您还将获得模型中每个变量的标准误差,测试统计量和p值。

df %>%
  spread(key = QST, value = VALUE, fill = 0) %>%
  select(-RESP_ID) %>%
  group_by(CLIENT) %>%
  nest() %>%
  mutate(lm_mod = map(data, function(d) lm(Q2 ~ ., data = d))) %>%
  mutate(mod_tidy = map(lm_mod, broom::tidy, quick = T)) %>%
  unnest(mod_tidy) %>%
  head()
#> # A tibble: 6 x 3
#>   CLIENT term        estimate
#>   <fct>  <chr>          <dbl>
#> 1 A      (Intercept)    2.67 
#> 2 A      Q1             0.333
#> 3 A      Q10           NA    
#> 4 A      Q3            -0.333
#> 5 A      Q4            -1.   
#> 6 A      Q5             1.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM