简体   繁体   中英

Find column value contained in another column R

I have multiple columns of addresses, where they may contain duplicated information (but generally will not have exactly duplicated information).

The following code will provide an example of my issue,

id= c(1, 2)
add1 = c("21ST AVE", "5TH ST")
add2 = c("21ST AVE BLAH ST", "EAST BLAH BLVD")
df = data.frame(id, add1, add2) 
df$combined = paste(add1, add2)
df

This gives the following result,

  id      add1             add2                  combined
1   1 21ST AVE 21ST AVE BLAH ST 21ST AVE 21ST AVE BLAH ST
2   2   5TH ST   EAST BLAH BLVD     5TH ST EAST BLAH BLVD

The conclusion I need is the following,

  id      add1             add2                  combined
1   1 21ST AVE 21ST AVE BLAH ST          21ST AVE BLAH ST
2   2   5TH ST   EAST BLAH BLVD     5TH ST EAST BLAH BLVD

I wish to identify if what's in add1 is contained in add2 . If I find that add2 contains the same information that add1 provides, then I either want to avoid combining those particular column values or delete the repeated information in the combined column (which I believe would require solving a different issue of repeated phrases in a string). I have not been able to find an example of finding column values that are 'contained in' rather than 'exact' - and I'm working with over 500K cases in a dataset where this issue is a common occurrence. Any help is appreciated.

We split the second and third column by one or more space ( \\\\s+ ), then paste the union of the corresponding rows with mapply to create the 'combined'

lst <- lapply(df[2:3], function(x) strsplit(as.character(x), "\\s+"))
df$combined <- mapply(function(x,y) paste(union(x, y), collapse=" "), lst$add1, lst$add2)
df$combined
#[1] "21ST AVE BLAH ST"      "5TH ST EAST BLAH BLVD"

Or another option is gsub

gsub("((\\w+\\s*){2,})\\1", "\\1", do.call(paste, df[2:3]))
#[1] "21ST AVE BLAH ST"      "5TH ST EAST BLAH BLVD"

Here's one way to accomplish this where the ifelse tests whether add1 is in add2 , and if so, then doesn't include it, otherwise it combines them:

id= c(1, 2)
add1 = c("21ST AVE", "5TH ST")
add2 = c("21ST AVE BLAH ST", "EAST BLAH BLVD")
df = data.frame(id, add1, add2, stringsAsFactors = F) 

require(stringr)
require(dplyr)

df %>% mutate(combined = ifelse(str_detect(add2, add1), 
                                add2, 
                                str_c(add1, add2)))

Output:

  id     add1             add2             combined
1  1 21ST AVE 21ST AVE BLAH ST     21ST AVE BLAH ST
2  2   5TH ST   EAST BLAH BLVD 5TH STEAST BLAH BLVD

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM