My dataframe consists of scores for different questions asked in a survey, over 3 fiscal years (FY13, FY14 & FY15). The results are presented by Region
.
Here's what a sample of the actual dataframe looks like, where we have two questions per region, asked in different years.
testdf=data.frame(FY=c("FY13","FY14","FY15","FY14","FY15","FY13","FY14","FY15","FY13","FY15","FY13","FY14","FY15","FY13","FY14","FY15"),
Region=c(rep("AFRICA",5),rep("ASIA",5),rep("AMERICA",6)),
QST=c(rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",3)),
Very.Satisfied=runif(16,min = 0, max=1),
Total.Very.Satisfied=floor(runif(16,min=10,max=120)))
My Objective
For each region, my objective is to identify which question experienced the most significant upward evolution across this 3 year time frame. In order to measure significant upward movements, I have decided to use the slope of regression as a parameter.
The question with the most significant upward evolution within a region over the 3 years time frame will be the one with the steepest positive slope .
Using this logic, I have decided to do the following -
1) For each combination of Region
and QST
, I run the lm
function.
2) I extract the slope for each combination, and store it as a separate variable. Then for each region I filter out the question with the maximum slope value.
My Attempt
Here is my attempt at solving this.
test_final=testdf %>%
group_by(Region,QST) %>%
map(~lm(FY ~ Very.Satisfied, data = .)) %>%
map_df(tidy) %>%
filter(term == 'circumference') %>%
select(estimate) %>%
summarise(Value = max(estimate))
However when I run this I get an error message saying that object
FY
was not found.
Additional requirement
Also I'd like this to work only for questions that have at least 2 consecutive years of data for comparison. But I'm unable to figure out how to factor this condition into my code.
Any help with this would be greatly appreciated.
This doesn't do the "at least two consecutive years" part, but it does the "get the question with the largest slope" part:
library(dplyr)
test_final = testdf %>%
mutate(FY.num = as.numeric(gsub("FY", "", FY))) %>%
group_by(Region, QST) %>%
mutate(lm_slope = lm(Very.Satisfied ~ FY.num)$coefficients[["FY.num"]]) %>%
ungroup() %>%
group_by(Region) %>%
filter(lm_slope == max(lm_slope))
Here is a similar version with filtering by group size/contiguity (had written it by the time you posted so figured I might as well go ahead).
library(tidyverse)
set.seed(42)
testdf=data.frame(FY=c("FY13","FY14","FY15","FY14","FY15","FY13","FY14","FY15","FY13","FY15","FY13","FY14","FY15","FY13","FY14","FY15"),
Region=c(rep("AFRICA",5),rep("ASIA",5),rep("AMERICA",6)),
QST=c(rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",2),rep("Q2",3),rep("Q5",3)),
Very.Satisfied=runif(16,min = 0, max=1),
Total.Very.Satisfied=floor(runif(16,min=10,max=120)))
test_final <- testdf %>%
group_by(Region,QST) %>% # group by region
mutate(numdate = as.numeric(str_remove(FY, "FY"))) %>%
filter(n() >= 2 & max(diff(numdate)) < 2) %>% # filter out singleton groups
mutate(slopes = coef(lm(Very.Satisfied~numdate))[2])
test_final %>% select(Region, QST, slopes)
#> # A tibble: 14 x 3
#> # Groups: Region, QST [5]
#> Region QST slopes
#> <fct> <fct> <dbl>
#> 1 AFRICA Q2 -0.314
#> 2 AFRICA Q2 -0.314
#> 3 AFRICA Q2 -0.314
#> 4 AFRICA Q5 -0.189
#> 5 AFRICA Q5 -0.189
#> 6 ASIA Q2 -0.192
#> 7 ASIA Q2 -0.192
#> 8 ASIA Q2 -0.192
#> 9 AMERICA Q2 0.238
#> 10 AMERICA Q2 0.238
#> 11 AMERICA Q2 0.238
#> 12 AMERICA Q5 0.342
#> 13 AMERICA Q5 0.342
#> 14 AMERICA Q5 0.342
test_final %>% group_by(Region) %>%
summarise(Value = max(slopes),
Top_Question = QST[which.max(slopes)])
#> # A tibble: 3 x 3
#> Region Value Top_Question
#> <fct> <dbl> <fct>
#> 1 AFRICA -0.189 Q5
#> 2 AMERICA 0.342 Q5
#> 3 ASIA -0.192 Q2
Created on 2019-01-21 by the reprex package (v0.2.1)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.