简体   繁体   中英

R Storing regression coefficients in data frame column by group

I have a data frame with results from a survey. The results are stored in a verticalized format. The data frame looks like this -

set.seed(1000)

df = data.frame(RESP_ID=c(rep(1,6),rep(2,8),rep(3,9),rep(4,5),rep(5,4),rep(6,10),rep(7,4),rep(8,8),rep(9,8),rep(10,10)),
                CLIENT=c(rep("A",6),rep("A",8),rep("A",9),rep("A",5),rep("A",4),rep("B",10),rep("B",4),rep("B",8),rep("B",8),rep("B",10)),
                QST=c(paste0("Q",c(1:6)),paste0("Q",c(1:8)),paste0("Q",c(1:9)),paste0("Q",c(1:5)),paste0("Q",c(1:4)),paste0("Q",c(1:10)),paste0("Q",c(1:4)),paste0("Q",c(1:8)),paste0("Q",c(1:8)),paste0("Q",c(1:10))),
                VALUE=round(runif(72,1,4),0))

Description of dataframe

RESP_ID = Respondent ID. Each ID correspondents to a single respondent. In this sample data frame, we have 10 respondents.

CLIENT = Correspondents to the name of the client whose respondents were surveyed. In this sample data frame, we have two clients (A & B).

QST = Corresponds to the question number in the survey.

VALUE = Corresponds to the answer option for the question. All questions have 4 answer options (1 to 4).

Objective

For each client and question combination, I'd like to create a separate column that stores the regression coefficient for that question regressed to Q2 in the QST column.

So in the regression model, Q2 is the dependent variable, and all other questions are the independent variables.

My attempt

My attempt is not giving me the result I want.

slopesdf = df %>%
  spread(QST, VALUE, fill = 0) %>%
  group_by(CLIENT) %>%
  mutate(COEFFICIENT=lm(Q2 ~ .))

I am trying to first group by CLIENT & QST and then find the slopes for each question regressed with Q2. I'm sure there's a better way of doing this.

Currently, my attempt gives me the following error message -

Error in mutate_impl(.data, dots) : Evaluation error: '.' dans la formule et pas d'argument 'data'

Desired output

I'd like the final data frame to contain three columns: one for CLIENT , one for QST and a third called COEFFICIENT with the coefficients for each combination of CLIENT and QST regressed with Q2 as response variable.

Any help on this would be greatly appreciated.

I'm not 100% sure that this output is what you're after, but, is this on the right track?

df2 <- df %>%
  spread(QST, VALUE, fill = 0) %>%
  split(.$CLIENT) %>%
  lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients }) %>%
  do.call(rbind, .) %>%
  data.frame(.) %>%
  mutate(CLIENT = rownames(.)) %>%
  gather(QST, COEFFICIENT, -CLIENT) %>%
  arrange(CLIENT)


> df2
   CLIENT          QST   COEFFICIENT
1       A X.Intercept. -1.200000e+01
2       A           Q1  1.000000e+00
3       A          Q10            NA
4       A           Q3  2.000000e+00
5       A           Q4  3.000000e+00
6       A           Q5  5.000000e-01
7       A           Q6            NA
8       A           Q7            NA
9       A           Q8            NA
10      A           Q9            NA
11      B X.Intercept.  5.000000e+00
12      B           Q1 -1.326970e-16
13      B          Q10  1.666667e+00
14      B           Q3  3.726559e-15
15      B           Q4 -2.000000e+00
16      B           Q5            NA
17      B           Q6            NA
18      B           Q7            NA
19      B           Q8            NA
20      B           Q9            NA

Edit:

Running the splitting component only generates a list of wide-format dataframes for each client:

df %>%
  spread(QST, VALUE, fill = 0) %>%
  split(.$CLIENT) 

$A
  RESP_ID CLIENT Q1 Q10 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
1       1      A  4   0  1  4  3  3  2  0  0  0
2       2      A  2   0  2  2  3  2  4  4  3  0
3       3      A  2   0  2  3  3  1  2  4  2  3
4       4      A  3   0  3  4  2  1  0  0  0  0
5       5      A  3   0  4  4  3  0  0  0  0  0

$B
   RESP_ID CLIENT Q1 Q10 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
6        6      B  3   2  3  2  3  2  2  1  3  3
7        7      B  2   0  3  2  2  0  0  0  0  0
8        8      B  3   0  2  4  1  3  3  2  3  0
9        9      B  2   0  1  4  2  1  3  1  2  0
10      10      B  3   2  3  3  3  3  4  2  3  3

Note that all the zeroes are filling in for questions where your original data had no values- if a question wasn't answered. See Ben Bolker's answer on that point.

If you now include the code to run the lm on each of those, you get the coefficient values directly, which include the NA values seen above:

> df %>%
+   spread(QST, VALUE, fill = 0) %>%
+   split(.$CLIENT) %>%
+   lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients })
$A
(Intercept)          Q1         Q10          Q3          Q4          Q5          Q6          Q7          Q8          Q9 
  6.6666667   2.0000000          NA  -1.6666667  -0.6666667  -1.6666667          NA          NA          NA          NA 

$B
(Intercept)          Q1         Q10          Q3          Q4          Q5          Q6          Q7          Q8          Q9 
       13.0        -3.0        -0.5        -2.0          NA         2.0          NA          NA          NA          NA 

Edit 2:

Just to explore with a more complete dataset, if we use this df :

set.seed(42)
df <-
  expand.grid(RESP_ID = 1:10,
              CLIENT = c("A", "B"),
              QST = paste("Q", 1:10, sep = "")) %>%
  mutate(VALUE = round(runif(200, 1, 4), 0))

and run the same code, we get coefficients without NA values:

> df %>%
+   spread(QST, VALUE, fill = 0) %>%
+   split(.$CLIENT) %>%
+   lapply(., function(x) { lm(Q2 ~ ., x[, -c(1,2)])$coefficients }) %>%
+   do.call(rbind, .) %>%
+   data.frame(.) %>%
+   mutate(CLIENT = rownames(.)) %>%
+   gather(QST, COEFFICIENT, -CLIENT) %>%
+   arrange(CLIENT)
   CLIENT          QST COEFFICIENT
1       A X.Intercept.  6.50000000
2       A           Q1 -4.14285714
3       A           Q3  2.50000000
4       A           Q4  0.85714286
5       A           Q5  1.00000000
6       A           Q6 -0.64285714
7       A           Q7 -1.21428571
8       A           Q8 -1.85714286
9       A           Q9  2.50000000
10      A          Q10 -0.07142857
11      B X.Intercept. -4.69924812
12      B           Q1 -0.86466165
13      B           Q3  1.56390977
14      B           Q4  1.10150376
15      B           Q5 -0.86842105
16      B           Q6  0.87593985
17      B           Q7  0.57142857
18      B           Q8  0.25187970
19      B           Q9  0.79699248
20      B          Q10 -0.12781955

A solution that follows the logic in my brain (we need to have Q2 available as a separate variable ... once we rearrange the data in that way, we can run . (The NA values are definitely due to deficiencies in this tiny data set - cases where there's no variation in the predictor, so the response can't be estimated ...)

(df
    %>% group_by(RESP_ID,CLIENT)
    ## add a new variable for Q2
    %>% mutate(Q2=VALUE[QST=="Q2"])
    ## drop the old one
    %>% filter(QST!="Q2")
    %>% group_by(CLIENT,QST)
    ## run the regression by group; return a data frame
    %>% do(as.data.frame(rbind(coef(lm(Q2~VALUE,data=.)))))
    ## convert wide coefficients to long
    %>% tidyr::gather(coef,value,-c(CLIENT,QST))
    %>% arrange(CLIENT)
)

For tasks like this, I like the "many models" approach from R for Data Science . It fits in with the tidyverse style, using nested data frames and purrr::map to create a list-column of models. Then broom::tidy provides utilities for extracting information you need about the models.

I dropped the ID column just to get it out of the way after the data was spread, and grouped and nested by CLIENT :

library(tidyverse)

df %>%
  spread(key = QST, value = VALUE, fill = 0) %>%
  select(-RESP_ID) %>%
  group_by(CLIENT) %>%
  nest()
#> # A tibble: 2 x 2
#>   CLIENT data             
#>   <fct>  <list>           
#> 1 A      <tibble [5 × 10]>
#> 2 B      <tibble [5 × 10]>

After that, create a column of linear models. Passing quick = T to broom::tidy returns a simplified version of the model diagnostics table; without setting that, you'd also get standard error, test statistic, and p-value for each variable in the model.

df %>%
  spread(key = QST, value = VALUE, fill = 0) %>%
  select(-RESP_ID) %>%
  group_by(CLIENT) %>%
  nest() %>%
  mutate(lm_mod = map(data, function(d) lm(Q2 ~ ., data = d))) %>%
  mutate(mod_tidy = map(lm_mod, broom::tidy, quick = T)) %>%
  unnest(mod_tidy) %>%
  head()
#> # A tibble: 6 x 3
#>   CLIENT term        estimate
#>   <fct>  <chr>          <dbl>
#> 1 A      (Intercept)    2.67 
#> 2 A      Q1             0.333
#> 3 A      Q10           NA    
#> 4 A      Q3            -0.333
#> 5 A      Q4            -1.   
#> 6 A      Q5             1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM