简体   繁体   中英

Finding Matches Across Char Vectors in R

Given the below two vectors is there a way to produce the desired data frame? This represents a real world situation which I have to data frames the first contains a col with database values (keys) and the second contains a col of 1000+ rows each a file name (potentials) which I need to match. The problem is there can be multiple files (potentials) matched to any given key. I have worked with grep, merge, inner join etc. but was unable to incorporate them into one solution. Any advise is appreciated!

potentials <- c("tigerINTHENIGHT",
            "tigerWALKINGALONE",
            "bearOHMY",
            "bearWITHME",
            "rat",
            "imatchnothing")
keys <- c("tiger",
            "bear",
            "rat")


desired <- data.frame(keys, c("tigerINTHENIGHT, tigerWALKINGALONE", "bearOHMY, bearWITHME", "rat"))
names(desired) <- c("key", "matches")

Psudo code for what I think of as the solution:

#new column which is comma separated potentials
# x being the substring length i.e. x = 4 means true if first 4 letters match
function createNewColumn(keys, potentials, x){
  str result = na
  foreach(key in keys){
    if(substring(key, 0, x) == any(substring(potentals, 0 ,x))){ //search entire potential vector
      result += potential that matched + ', '
    }
  }
  return new column with result as the value on the current row
}

We can write a small functions to extract matches and then loop over the keys:

return_matches <- function(keys, potentials, fixed = TRUE) {
  vapply(keys, function(k) {
    paste(grep(k, potentials, value = TRUE, fixed = fixed), collapse = ", ")
  }, FUN.VALUE = character(1))
}

vapply is just a typesafe version of sapply meaning it will never return anything but a character vector. When you set fixed = TRUE the function will run a lot faster but does not recognise regular expressions anymore. Then we can easily make the desired data.frame :

df <- data.frame(
  key = keys,
  matches = return_matches(keys, potentials),
  stringsAsFactors = FALSE
)
df
#>         key                            matches
#> tiger tiger tigerINTHENIGHT, tigerWALKINGALONE
#> bear   bear               bearOHMY, bearWITHME
#> rat     rat                                rat

The reason for putting the loop in a function instead of running it directly is just to make the code look cleaner.

You can interate using grep

 > Match <- sapply(keys, function(item) {
                  paste0(grep(item, potentials, value = TRUE), collapse = ", ")
     } )     

> data.frame(keys, Match, row.names = NULL)
       keys                              Match
    1 tiger tigerINTHENIGHT, tigerWALKINGALONE
    2  bear               bearOHMY, bearWITHME
    3   rat                                rat

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM