简体   繁体   中英

Searching for a string within a column for merging in r

I want to search a column in a dataframe for a string within a column from a different dataframe, and then merge these together. For example:

I have this dataframe:

    location
1   2 high street, ca
2   24 long street, ba,UK
3   1 first avenue, ab
4   15 nant peris , ac
5   1 high street
6   second avenue, ca, UK

Then I want to match on this dataset:

   id      block
1  1        ab
2  2        ac
3  3        ab
4  5        cb
5  4        ba
6  2        ca

So I want to search "location" for any value within the column "block" then merge the columns block and id onto the first dataset so the merged dataset looks as follows:

    location              id     block
1 2 high street, ca       2       ca
2 24 long street, ba,UK   4       ba
3 1 first avenue, ab      1       ab
4 15 nant peris , ac      2       ac
5 1 high street           NA      NA
6 second avenue, ca,UK    2       ca

Reproducible code:

df1<-data.frame(id = factor(c(1,2,3,5,4,2)), block = c('ab','ac','ab','ca','ba','ca'))
df2<-data.frame(location = c('2 high street, ca','24 long street, ba, UK','1 first avenue, ab', '15 nant peris , ac','1 high street','second avenue, ca, UK'))

Here is one way to do this using the sqldf package:

library(sqldf)
sql <- "SELECT t1.location, t2.id, t2.block
        FROM df1 t1
        LEFT JOIN df2 t2
            ON t1.location LIKE '%, ' || t2.block OR
               t1.location LIKE '%, ' || t2.block || ',%';
results <- sqldf(sql)

The sqldf package runs on SQLite I believe, and here is a link to a running SQLite demo using your data:

Demo

Solution using a lookup Table ltbl , (a names vector)

ltbl = 1:4  # lookup Table
names(ltbl) = c('ab','ac','ca','ba')

#ab ac ca ba 
# 1  2  3  4

new<-
do.call(
    rbind,
    apply(df2, 1, function(x) {
        ans <- names(ltbl)[stringr::str_detect(x, paste0("\\b", names(ltbl), "\\b"))]
        cbind.data.frame( id = I(ltbl[ans]), block = I(ans) )[1,]
    })
)


cbind(df2, new)

#                  location id block
#ca       2 high street, ca  3    ca
#ba  24 long street, ba, UK  4    ba
#ab      1 first avenue, ab  1    ab
#ac      15 nant peris , ac  2    ac
#NA           1 high street NA  <NA>
#ca1  second avenue, ca, UK  3    ca

Transform your long Block into that lookup Table:

example: every id can only have one block, Tim has addressed this already

myLongCrasyBlock <- data.frame(id = factor(c(1:3,1:3)), block = c('ab','ac','ab','ab','ac','ab'))

myLongCrasyBlock <- unique(myLongCrasyBlock)
ltbl             <- `names<-`(myLongCrasyBlock$id, myLongCrasyBlock$block)

I tried to find a solution with no special packages needed using your reproducible code.

# Creating dataframes
  df1<-data.frame(id = factor(c(1,2,3,5,4,2)), block = c('ab','ac','ab','ca','ba','ca'))
  df2<-data.frame(location = c('2 high street, ca','24 long street, ba, UK','1 first avenue, ab', '15 nant peris , ac','1 high street','second avenue, ca, UK'))

# Make some varaibles as character
  df1$block <- as.character(df1$block)
  df2$location <- as.character(df2$location)

# Create new variable as block        
  df2$block <- "NA"

# Starting the loop
  for (i in 1:length(df1$block)) {

    x <- grep(df1$block[i], df2$location, value = T) #Find location values with the same block value 

    y <- df2[df2$location %in% x,] #Create a new dataframe only with the values found

    rowstokeep <- which(rownames(df2) %in% rownames(y)) # Get the rows of those values

    df2$block[rowstokeep] <- df1$block[i] # Input the block value in the correspond location value
  }

# Merge by "block" variable to get the ID        
       df3 <- merge(df1, df2, by.x = "block", by.y = "block")

I hope this is useful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM