简体   繁体   English

R在数据框中按组捕获回归斜率

[英]R Capturing regression slopes by group in a dataframe

My dataframe consists of scores for different questions asked in a survey, over 3 fiscal years (FY13, FY14 & FY15). 我的数据框架包含3个财政年度(2013财年,2014财年和2015财年)调查中提出的不同问题的分数。 The results are presented by Region . 结果按Region列出。

Here's what a sample of the actual dataframe looks like, where we have two questions per region, asked in different years. 这是实际数据框的示例 ,每个区域有两个问题,分别在不同的年份提出。

testdf=data.frame(FY=c("FY13","FY14","FY15","FY14","FY15","FY13","FY14","FY15","FY13","FY15","FY13","FY14","FY15","FY13","FY14","FY15"),
              Region=c(rep("AFRICA",5),rep("ASIA",5),rep("AMERICA",6)),
              QST=c(rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",3)),
              Very.Satisfied=runif(16,min = 0, max=1),
              Total.Very.Satisfied=floor(runif(16,min=10,max=120)))

My Objective 我的目标

For each region, my objective is to identify which question experienced the most significant upward evolution across this 3 year time frame. 对于每个地区,我的目标是确定在过去三年中,哪个问题经历了最显着的向上演变 In order to measure significant upward movements, I have decided to use the slope of regression as a parameter. 为了测量显着的向上运动,我决定将回归的斜率用作参数。

The question with the most significant upward evolution within a region over the 3 years time frame will be the one with the steepest positive slope . 在3年的时间范围内,一个地区内上升趋势最为明显的问题将是斜率最陡的问题

Using this logic, I have decided to do the following - 使用此逻辑,我决定执行以下操作-

1) For each combination of Region and QST , I run the lm function. 1)对于RegionQST每种组合,我运行lm函数。

2) I extract the slope for each combination, and store it as a separate variable. 2)我提取每种组合的斜率,并将其存储为单独的变量。 Then for each region I filter out the question with the maximum slope value. 然后,对于每个区域,我用最大斜率值过滤掉问题。

My Attempt 我的尝试

Here is my attempt at solving this. 这是我试图解决这个问题的尝试。

test_final=testdf %>%   
group_by(Region,QST) %>% 
map(~lm(FY ~ Very.Satisfied, data = .)) %>%
map_df(tidy) %>%
filter(term == 'circumference') %>%
select(estimate) %>% 
summarise(Value = max(estimate))

However when I run this I get an error message saying that object FY was not found. 但是,当我运行此程序时,我收到一条错误消息,指出未找到对象FY

Additional requirement 附加要求

Also I'd like this to work only for questions that have at least 2 consecutive years of data for comparison. 我也希望此方法仅适用于连续两年至少有数据进行比较的问题。 But I'm unable to figure out how to factor this condition into my code. 但是我无法弄清楚如何将这种情况纳入我的代码中。

Any help with this would be greatly appreciated. 任何帮助,将不胜感激。

This doesn't do the "at least two consecutive years" part, but it does the "get the question with the largest slope" part: 这不会执行“至少连续两年”部分,但是会执行“获得最大斜率的问题”部分:

library(dplyr)
test_final = testdf %>%
  mutate(FY.num = as.numeric(gsub("FY", "", FY))) %>%
  group_by(Region, QST) %>%
  mutate(lm_slope = lm(Very.Satisfied ~ FY.num)$coefficients[["FY.num"]]) %>%
  ungroup() %>%
  group_by(Region) %>%
  filter(lm_slope == max(lm_slope))

Here is a similar version with filtering by group size/contiguity (had written it by the time you posted so figured I might as well go ahead). 这是一个类似的版本,具有按组大小/连续性进行过滤的功能(在您发帖时已经写好了,所以我想也可以继续进行)。

library(tidyverse)
set.seed(42)
testdf=data.frame(FY=c("FY13","FY14","FY15","FY14","FY15","FY13","FY14","FY15","FY13","FY15","FY13","FY14","FY15","FY13","FY14","FY15"),
                  Region=c(rep("AFRICA",5),rep("ASIA",5),rep("AMERICA",6)),
                  QST=c(rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",3)),
                  Very.Satisfied=runif(16,min = 0, max=1),
                  Total.Very.Satisfied=floor(runif(16,min=10,max=120)))

test_final <- testdf %>%   
  group_by(Region,QST) %>% # group by region
  mutate(numdate = as.numeric(str_remove(FY, "FY"))) %>% 
  filter(n() >= 2 & max(diff(numdate)) < 2) %>% # filter out singleton groups
  mutate(slopes = coef(lm(Very.Satisfied~numdate))[2])
test_final %>% select(Region, QST, slopes)
#> # A tibble: 14 x 3
#> # Groups:   Region, QST [5]
#>    Region  QST   slopes
#>    <fct>   <fct>  <dbl>
#>  1 AFRICA  Q2    -0.314
#>  2 AFRICA  Q2    -0.314
#>  3 AFRICA  Q2    -0.314
#>  4 AFRICA  Q5    -0.189
#>  5 AFRICA  Q5    -0.189
#>  6 ASIA    Q2    -0.192
#>  7 ASIA    Q2    -0.192
#>  8 ASIA    Q2    -0.192
#>  9 AMERICA Q2     0.238
#> 10 AMERICA Q2     0.238
#> 11 AMERICA Q2     0.238
#> 12 AMERICA Q5     0.342
#> 13 AMERICA Q5     0.342
#> 14 AMERICA Q5     0.342

test_final %>% group_by(Region) %>% 
  summarise(Value = max(slopes),
            Top_Question = QST[which.max(slopes)])
#> # A tibble: 3 x 3
#>   Region   Value Top_Question
#>   <fct>    <dbl> <fct>       
#> 1 AFRICA  -0.189 Q5          
#> 2 AMERICA  0.342 Q5          
#> 3 ASIA    -0.192 Q2

Created on 2019-01-21 by the reprex package (v0.2.1) reprex软件包 (v0.2.1)创建于2019-01-21

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM