简体   繁体   中英

R - Finding partial matches in strings between two vectors

I'm working with a few sets of data in a company where certain groups write their project codes slightly differently than others.

For example, Group A uses a 5-Character project code C106A , while Group B uses a long code such as HD-01 C106A 00. Note the matching characters.

What I'm trying to do is group all this project data together, and a crucial step in the data clean up is fixing these codes so that I can group by them. I have a library of all the long codes because Group B has more regulations placed on them, so I can kind of count on their data.

I would like to have my code perform a search on Group B's library, using the string in Group A's data, and when it find's a matching set from Group B, replace the Value from Group A's data. I've been playing with Stringr's str_detect commands, I can't seem to get it to work.

Portion of Group A's Project Codes:

Project.Code

C106A

C117A

C254A

C342A

C365A

C371A

C391A

C397A

C397B

C397C

C399A

C400A

C404A

C405A

C414A

C417A

Portion of Group B's Library:

Project.Code

HP-C3651001

HP-C3651003

HP-C3651009

HP-C3651P00

HP-C365A000

HP-C365B000

HP-C3421001

HP-C3421002

HP-C3421003

HP-C3421P00

HP-C342A000

HP-C1061001

HP-C1061011

HP-C1061013

HP-C1061016

HP-C1061P00

HP-C106A000

Something like this makes sense and works:

str_detect(GroupA$Project.Code,"C365A")

but I can't seem to do this:

str_detect(GroupA$Project.Code,GroupB$Project.Code)

One option is to paste into single string with |signifying the OR

library(stringr)
str_detect(GroupB$Project.Code, str_c(GroupA$Project.Code, collapse ="|"))

Not sure if the following is what you want

idx <- Filter(length,sapply(grpA, function(x) grep(x,grpB)))
df <- data.frame(grpA_idx = match(names(idx),grpA),grpA_code =names(idx), grpB_code=grpB[unlist(idx)])

which gives:

> df
  grpA_idx grpA_code   grpB_code
1       1     C106A HP-C106A000
2       4     C342A HP-C342A000
3       5     C365A HP-C365A000

DATA

grpA <- c("C106A", "C117A", "C254A", "C342A", 
                                   "C365A", "C371A", "C391A", "C397A", "C397B", "C397C", "C399A", 
                                   "C400A", "C404A", "C405A", "C414A", "C417A")
grpB <- c("HP-C1061001", "HP-C1061011", "HP-C1061013", "HP-C1061016", 
              "HP-C1061P00", "HP-C106A000", "HP-C3421001", "HP-C3421002", "HP-C3421003", 
              "HP-C3421P00", "HP-C342A000", "HP-C3651001", "HP-C3651003", "HP-C3651009", 
              "HP-C3651P00", "HP-C365A000", "HP-C365B000")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM