简体   繁体   中英

Match multiple substrings from another list of all possible substrings in R

While I had received some great feedback on my previous post, I believe that my original question was not entirely clear and hence the answers did not generate the desired outcome.

I have a long vector of a character variable strings with about 600K observations having 800 unique string values. I am trying to narrow down these 800 unique strings to about 20 unique strings based on another vector of important string variables.

Here is an example:

col1 <- c("CORE_I5-xxxx_6C_VPRO", "A6-xxxx_MB", "CORE_I7-xxxx_4C_VPRO_MB", "INTEL_CORE_I3_MB", NA)
col2 <- c("CORE_I5_VPRO", NA, "CORE_I7_VPRO", "INTEL_CORE_I3", NA)

The new column (col2) has been created from the old column (col1) based on the following character variable (V) only by retaining the strings included in V:

V <- c("CORE", "INTEL", "I5", "I7", "I3", NA)

I have tried the following code but it is only giving me part of the strings, but not all the elements in each observation.

library(stringr)
col2 <- str_extract(col1, paste(V, collapse="|"))

I have also tried the suggestions to my previous post but unfortunately I am not getting the desired output. Thank you all for the help

Here we create x and then use grepl :

library(stringr)

x <- str_replace_all(str_remove(S, '(\\d+\\_)'), '\\_', '')

x[grepl(paste0(V, collapse = "|"), x)]
[1] "INTELI5VPRO" "COREdfds"    "VPROLI9" 

You can do follow your original approach, but using str_extract_all and sapply() , like this:

sapply(str_extract_all(S, paste(V, collapse = "|")),paste0, collapse="")

Output

[1] "INTELI5VPRO" "CORE"        ""            "VPROI9"      "NA"         

Or, you can do something like this:

lapply(S, \(s) {
    x = strsplit(s, "_")[[1]]
    result = paste0(x[x %in% V], collapse="")
    ifelse(result=="", as.character(NA),result)
}) %>% unlist()

Output

[1] "INTELI5VPRO" "CORE"        NA            "I9"          NA  

str_extract_all gives you a matrix of hits. Concatenating the strings of each row almost gives you your desired result. Only the third item is "" instead of NA .

library(stringr)

S = c('123_INTEL_I5_VPRO', '531_CORE_dfds', '93_RAYZEN_29dad', '452_VPROL_I9', NA)
V = c('INTEL','CORE', 'VPRO', 'I5', 'I9')


matches <- sapply(V, function (x) str_extract_all(S, x))
result <- apply(matches, 1, function(x) str_flatten(unlist(x))) # concatenate rows
result[result == ""] <- NA
result
#> [1] "INTELVPROI5" "CORE"        NA            "VPROI9"      NA

Created on 2022-06-30 by the reprex package (v2.0.1)

You'd want to use str_extract_all and take care of the empty extractions like the one in position 3 (based on your code):

sapply(str_extract_all(S, paste(V, collapse = "|")),
       function(x) ifelse(length(x) != 0, str_flatten(x), NA)
       )

#> [1] "INTELI5VPRO" "CORE"        NA            "VPROI9"      NA           

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM