简体   繁体   中英

Filter or ifelse across multiple columns

I'm doing research of the communication lines to a patient when they get sick. So for example: A person gets sick and goes to the doctor (A), then gets to the hospital (B), gets into contact with insurance (C) etc. The order is different for each patient. For instance, one patient will directly go to the hospital while the other person will first check the insurance etc. We've followed patients through the whole process and after the came into contact with a different authority, we let them fill out another survey. So after each authority ("step") we got the score for a survey. This gives me the following dataset set-up (in reality it is a very large dataset):

Patient<-c(1,1,1,1,1,1,1,2,2,2,2)
sample6<-c("A","A","A","A","A","A","A","A","A","A","A")
sample5<-c("Stop","B","B","B","B","B","B","Stop","C","C","C")
sample4<-c(NA,"Stop","C","C","C","C","C",NA, "Stop","F","F")
sample3<-c(NA,NA,"Stop","D","D","D","D",NA, NA,"Stop","G")
sample2<-c(NA,NA,NA,"Stop","E","E","E",NA, NA,NA,"Stop")
sample1<-c(NA,NA,NA,NA, "Stop","F","F",NA,NA,NA, NA)
sample0<-c(NA,NA,NA,NA, NA,"Stop","G",NA,NA,NA, NA)
sample00<-c(NA,NA,NA,NA, NA,NA,"Stop",NA,NA,NA, NA)
Score<-c(90,88,65,44,78,98,66,38,93,88,80)
Time<-c("01-01-2018", "02-01-2018", "03-01-2018", "04-01-2018", "05-01-2018", "06-01-2018", "07-01-2018","01-02-2018", "02-02-2018", "05-02-2018", "06-02-2018")

df<-data.frame("Patient"=Patient, "step0"=sample6, "step1"=sample5, "step2"=sample4, "step3"=sample3, "step4"=sample2, 
               "step5"=sample1,"step6"= sample0, "step7"=sample00, "Score"=Score, "Time"=Time)

> df
   Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time
1        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018
2        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018
3        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018
4        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018
5        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018
6        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018
7        1     A     B     C     D     E     F     G  Stop    66 07-01-2018
8        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018
9        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018
10       2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018
11       2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018

So for example: row 1 has the survey score after authority A, row 2 is for the same patient and has the score of the survey after authority B etc. Now I want to compare columns that have the same final proces, I will take "F" as an example but it could also be "C" for another analysis. So now I want to select all rows that indicate "F" as the final authority AND the row before so that I can compare them.

So I want to create this dataset:

   Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time Indicator
1        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018         0
2        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018         0
3        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018         0
4        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018         0
5        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018         Before
6        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018         After
7        1     A     B     C     D     E     F     G  Stop    66 07-01-2018         0
8        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018         0
9        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018         Before
10       2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018         After
11       2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018         0

I did manage to indicate the rows that contain "F" plus the previous:

ProcessColumns <- 2:9
d <- df[,ProcessColumns] == "F"
df$Indicator <- rowSums(d,na.rm=T)
df$filter[which(df$filter %in% 1)-1] <- "Before"
df$filter[which(df$filter %in% 1)] <- "After"

But now it indicates ALL the rows containing "F" not just in the end.. anyone who can help me?

We can do something like

df %>% mutate(sum=rowSums(!is.na(.[2:9]))) %>% 
group_by(Patient) %>% mutate(max = sum-max(sum), Indicator  = case_when(max == -2 ~ "Before", max == -1 ~ "After", TRUE ~ as.character(0)))

# A tibble: 11 x 14
# Groups:   Patient [2]
     Patient step0 step1 step2 step3 step4 step5 step6 step7 Score Time         sum   max Ind   
     <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <fct>      <dbl> <dbl> <chr> 
 1    1.00 A     Stop  NA    NA    NA    NA    NA    NA     90.0 01-01-2018  2.00 -6.00 0     
 2    1.00 A     B     Stop  NA    NA    NA    NA    NA     88.0 02-01-2018  3.00 -5.00 0     
 3    1.00 A     B     C     Stop  NA    NA    NA    NA     65.0 03-01-2018  4.00 -4.00 0     
 4    1.00 A     B     C     D     Stop  NA    NA    NA     44.0 04-01-2018  5.00 -3.00 0     
 5    1.00 A     B     C     D     E     Stop  NA    NA     78.0 05-01-2018  6.00 -2.00 Before
 6    1.00 A     B     C     D     E     F     Stop  NA     98.0 06-01-2018  7.00 -1.00 After 
 7    1.00 A     B     C     D     E     F     G     Stop   66.0 07-01-2018  8.00  0    0     
 8    2.00 A     Stop  NA    NA    NA    NA    NA    NA     38.0 01-02-2018  2.00 -3.00 0     
 9    2.00 A     C     Stop  NA    NA    NA    NA    NA     93.0 02-02-2018  3.00 -2.00 Before
10    2.00 A     C     F     Stop  NA    NA    NA    NA     88.0 05-02-2018  4.00 -1.00 After 
11    2.00 A     C     F     G     Stop  NA    NA    NA     80.0 06-02-2018  5.00  0    0 

Update: Inspired by @Andre Elrico answer

df %>% unite(All, matches("step"), sep="", remove=F ) %>% 
       mutate(Ind = str_detect(All,"BStop"), Indicator = case_when( lead(Ind) == TRUE ~ "Before", Ind == TRUE ~ "After", TRUE ~ as.character(0))) %>% 
       select(-All,-Ind)

Or you can:

library(dplyr)

After_IND <- df %>% apply(.,1,paste,collapse="") %>% grepl("FStop",.)
Before_IND<- lead(After_IND,1,F)

df$Indicator <- 0
df$Indicator[After_IND]<-"After"
df$Indicator[Before_IND]<-"Before"

#  Patient step0 step1 step2 step3 step4 step5 step6 step7 Score       Time Indicator
#        1     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    90 01-01-2018         0
#        1     A     B  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    88 02-01-2018         0
#        1     A     B     C  Stop  <NA>  <NA>  <NA>  <NA>    65 03-01-2018         0
#        1     A     B     C     D  Stop  <NA>  <NA>  <NA>    44 04-01-2018         0
#        1     A     B     C     D     E  Stop  <NA>  <NA>    78 05-01-2018    Before
#        1     A     B     C     D     E     F  Stop  <NA>    98 06-01-2018     After
#        1     A     B     C     D     E     F     G  Stop    66 07-01-2018         0
#        2     A  Stop  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>    38 01-02-2018         0
#        2     A     C  Stop  <NA>  <NA>  <NA>  <NA>  <NA>    93 02-02-2018    Before
#        2     A     C     F  Stop  <NA>  <NA>  <NA>  <NA>    88 05-02-2018     After
#        2     A     C     F     G  Stop  <NA>  <NA>  <NA>    80 06-02-2018         0

Please note:

If you want to compare B for eg. you have to change:

... %>% grepl("BStop",.)

A tidyverse with lot of lines, but generally works.

library(tidyverse)
df %>%
  rownames_to_column() %>% 
  gather(k,v,-Patient,-rowname,-Score, -Time) %>% 
  group_by(rowname) %>% 
  mutate(Indicator=ifelse(any(v %in%"F" ),"After",NA)) %>% 
  spread(k,v)  %>% 
  arrange(as.numeric(rowname)) %>% 
  group_by(Patient) %>% 
  mutate(Indicator=ifelse(duplicated(Indicator), NA, Indicator)) %>% 
  mutate(Indicator2=ifelse(lead(Indicator) == "After", "Before", NA)) %>% 
  mutate(Indicator=ifelse(!is.na(Indicator2), Indicator2, Indicator)) %>% 
  select(Patient, starts_with("step"), Score, Time,Indicator, -Indicator2,-rowname) %>% 
  ungroup()
# A tibble: 11 x 12
   Patient step0 step1 step2 step3 step4 step5 step6 step7 Score Time       Indicator
     <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <fct>      <chr>    
 1       1 A     Stop  NA    NA    NA    NA    NA    NA       90 01-01-2018 NA       
 2       1 A     B     Stop  NA    NA    NA    NA    NA       88 02-01-2018 NA       
 3       1 A     B     C     Stop  NA    NA    NA    NA       65 03-01-2018 NA       
 4       1 A     B     C     D     Stop  NA    NA    NA       44 04-01-2018 NA       
 5       1 A     B     C     D     E     Stop  NA    NA       78 05-01-2018 Before   
 6       1 A     B     C     D     E     F     Stop  NA       98 06-01-2018 After    
 7       1 A     B     C     D     E     F     G     Stop     66 07-01-2018 NA       
 8       2 A     Stop  NA    NA    NA    NA    NA    NA       38 01-02-2018 NA       
 9       2 A     C     Stop  NA    NA    NA    NA    NA       93 02-02-2018 Before   
10       2 A     C     F     Stop  NA    NA    NA    NA       88 05-02-2018 After    
11       2 A     C     F     G     Stop  NA    NA    NA       80 06-02-2018 NA  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM