简体   繁体   中英

Find matches between two string columns in R

In order for me to solve a tag migration problem, I have to compare between two character columns and assess whether there are coincidences between both columns or not.

To sum up, given a dataframe like this:

old_tags            new_tags
burger              burger, american
italian, pizza      italian
latin, peruvian     peruvian, latin
french              pizza

I'd like to add a third column like this one:

old_tags            new_tags            match
burger              burger, american    TRUE
italian, pizza      italian             TRUE
latin, peruvian     peruvian, latin     TRUE
french              pizza               FALSE

Until now I've unsuccessfully tried with functions such as str_match , str_detect and so on. It usually returns me FALSE when comparing pairs of strings that should be actually TRUE such the example I've put in [3,] .

Thanks a lot in advance.

One base R approach could be to split the string on comma. Using Map find intersecting words and create a logical value if there is at least one value which intersects.

df$match <- lengths(Map(intersect, strsplit(df$old_tags, ", "), 
                    strsplit(df$new_tags, ", "))) > 0

df
#         old_tags         new_tags match
#1          burger burger, american  TRUE
#2  italian, pizza          italian  TRUE
#3 latin, peruvian  peruvian, latin  TRUE
#4          french            pizza FALSE

data

df <- structure(list(old_tags = c("burger", "italian, pizza", "latin, peruvian", 
"french"), new_tags = c("burger, american", "italian", "peruvian, latin", 
"pizza")), row.names = c(NA, -4L), class = "data.frame")

A tidyverse - base possibility:

library(dplyr)
library(stringr)

df %>% 
   mutate(patterns = map_chr(strsplit(old_tags, ", "),paste,collapse="|"),
          Match = str_detect(new_tags, patterns)) %>% 
   select(-patterns)
         old_tags         new_tags Match
1          burger burger, american  TRUE
2  italian, pizza          italian  TRUE
3 latin, peruvian  peruvian, latin  TRUE
4          french            pizza FALSE

Or we can do str_extract with any

library(tidyverse)
df %>% 
   mutate(match = map2_lgl(str_extract_all(old_tags, "\\w+"), 
               str_extract_all(new_tags, "\\w+"),  ~ any(.x %in% .y)))
#         old_tags         new_tags match
#1          burger burger, american  TRUE
#2  italian, pizza          italian  TRUE
#3 latin, peruvian  peruvian, latin  TRUE
#4          french            pizza FALSE

data

df <- structure(list(old_tags = c("burger", "italian, pizza", "latin, peruvian", 
"french"), new_tags = c("burger, american", "italian", "peruvian, latin", 
"pizza")), row.names = c(NA, -4L), class = "data.frame")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM