简体   繁体   中英

R: Extract matching string in dataframe column

I have a dataframe and a set of keywords. I want to create a new column in the dataframe that matches any of the strings in the keywords and a second dataframe with not-matching strings.

keyword <- c('yellow','blue','red','green','purple')

my dataframe

colour id
blue A234
blue,black A5
yellow A6
blue,green,purple A7

What i hope to get is a dataframe like this:

colour id match non-match
blue A234 blue yellow,red,green,purple
blue,green A5 blue,green yellow,red,purple
yellow A6 yellow blue,red,green,purple
blue,green,purple A7 blue,green,purple yellow,red

I tried this to get the match column:

df %>% mutate(match = str_extract(paste(keyword,collapse="|"), tolower(colour)))

but it only worked for the first and third rows, not the 2nd and 4th rows. Appreciate any help with this and also to get a column of unmatched strings.

Get each colour in separate_rows splitting on comma and for each id you can find match using intersect and non_match with setdiff .

library(dplyr)
keyword <- c('yellow','blue','red','green','purple')

df %>%
  tidyr::separate_rows(colour, sep = ',\\s*') %>%
  group_by(id) %>%
  summarise(match = toString(intersect(keyword, colour)), 
            non_match = toString(setdiff(keyword, colour)), 
            colour = toString(colour))

#  id    match               non_match                  colour             
#* <chr> <chr>               <chr>                      <chr>              
#1 A234  blue                yellow, red, green, purple blue               
#2 A5    blue                yellow, red, green, purple blue, black        
#3 A6    yellow              blue, red, green, purple   yellow             
#4 A7    blue, green, purple yellow, red                blue, green, purple

data

df <- structure(list(colour =c("blue","blue,black", "yellow", "blue,green,purple"
), id = c("A234", "A5", "A6", "A7")),class = "data.frame",row.names = c(NA, -4L))

Here is a base R solution. We can use apply in row mode, and split the CSV string of colors into a vector. Then, use %in% to figure out what the non matching colors should be.

df$match <- df$colour
df$non_match <- apply(df, 1, function(x) {
    paste(keyword[!keyword %in% strsplit(x[1], ",", fixed=TRUE)[[1]]], collapse=",")
})
df

             colour   id             match               non_match
1              blue A234              blue yellow,red,green,purple
2        blue,green   A5        blue,green       yellow,red,purple
3            yellow   A6            yellow   blue,red,green,purple
4 blue,green,purple   A7 blue,green,purple              yellow,red

Data:

keyword <- c('yellow','blue','red','green','purple')
df <- data.frame(colour=c("blue", "blue,green", "yellow", "blue,green,purple"),
                 id=c("A234", "A5", "A6", "A7"), stringsAsFactors=FALSE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM