简体   繁体   中英

Filter / search in different rows in r, that are grouped by a specific column

I have a dataset that is similar to the repex below, where each subject has more than one row for their hobby, favorite food and their study major.

I am trying to identify for example those who have hiking as a hobby and meat as food. (the one that meets this criteria is subject c in the example below).

Is there a way to do this in dplyr or another package?


dd = structure(list(ID = c("a", "a", "a", "a", "b", "b", "b", "b", 
                      "b", "b", "c", "c", "c", "c", "c", "c"), itemType = c("hobby", 
                                                                            "hobby", "study", "food", "hobby", "hobby", "study", "study", 
                                                                            "food", "food", "hobby", "hobby", "study", "study", "study", 
                                                                            "food"), details = c("hiking, bike", "reading", "math, art", 
                                                                                                 "cheese, bread", "writing", "computer", "english", "science", 
                                                                                                 "meat, rice", "cheese", "reading", "swimming, hiking", "math, philosophy", 
                                                                                                 "computer", "social", "pasta, meat")), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                            -16L))


If I just try a simple dplyr filter as below, it won't work of course, it returns no items. is there another argument or something I can add to make it work ?

I never used database package, but will it be useful in this context?

dd %>% 
  filter( str_detect( details, "hiking") &
            str_detect(details, "meat"))

If we need to subset 'ID' having both 'hiking' , 'meat' in 'details', do a group_by 'ID' and then apply the str_detect for both 'hiking', 'meat', wrap with any ) and use & or ,

library(dplyr)
library(stringr)
dd %>%
  group_by(ID) %>%
  filter(any(str_detect(details, 'hiking')), any(str_detect(details, 'meat')))

-output

# A tibble: 6 x 3
# Groups:   ID [1]
#  ID    itemType details         
#  <chr> <chr>    <chr>           
#1 c     hobby    reading         
#2 c     hobby    swimming, hiking
#3 c     study    math, philosophy
#4 c     study    computer        
#5 c     study    social          
#6 c     food     pasta, meat     

Update

If we wanted to further do the detection based on subgroup, an option is to subset the column with == and apply the str_detect only those elements

dd %>% 
     group_by(ID) %>%
     filter(any(str_detect(details[itemType == 'hobby'], 'hiking')),
            any(str_detect(details[itemType == 'food'], 'meat')))
# A tibble: 6 x 3
# Groups:   ID [1]
#  ID    itemType details         
#  <chr> <chr>    <chr>           
#1 c     hobby    reading         
#2 c     hobby    swimming, hiking
#3 c     study    math, philosophy
#4 c     study    computer        
#5 c     study    social          
#6 c     food     pasta, meat     
 

Or using base R with ave and grepl

subset(dd, as.logical(ave(details, ID, 
  FUN = function(x) any(grepl('hiking', x)) & any(grepl('meat', x)))))

The reason it didn't return any row is because no element in 'details' have both 'hiking' and 'meat' as the & is doing elementwise comparison. Instead, we need to use the & on any of the elements in 'details' for each 'ID'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM