简体   繁体   中英

How to merge data frames where column1 is substring of column2

I have a data frame and would like to classify each row based on the value of column df$name. For the classification I have a two-column data frame tl with a column tl$name and tl$type. I would like to merge the two data frames on a like condition, grepl( tl$name, df$name ), instead of df$name = tl$name.

I have already tried by looping over all rows in df and seeing where there is a match with tl, but this seems very timeconsuming.

Eg:

df

  name        
# African elephant    
# Indian elephant    
# Silverback gorilla     
# Nile crocodile   
# White shark       

tl

  name        type
# elephant    mammal
# gorilla     mammal
# crocodile   reptile
# shark       fish

Another idea:

library(tidyverse)

df %>%
  separate(name, into = c("t", "name")) %>%
  left_join(tl)

Which gives:

#           t      name    type
#1    African  elephant  mammal
#2     Indian  elephant  mammal
#3 Silverback   gorilla  mammal
#4       Nile crocodile reptile
#5      White     shark    fish

We can remove the substring with sub by matching one or more non-white space ( \\\\S+ ) followed by one or more white space ( \\\\s+ ) from the start ( ^ ) of the string, replace it with blank ( "" ) and merge with the second dataset ('tl')

merge(transform(df, name = sub("^\\S+\\s+", "", name)), tl)
#      name    type
#1 crocodile reptile
#2  elephant  mammal
#3  elephant  mammal
#4   gorilla  mammal
#5     shark    fish

If we need to update the first dataset,

df$type <- with(df, tl$type[match(sub("^\\S+\\s+", "", name), tl$name)])
df

  name        
# African elephant    
# Indian elephant    
# Silverback gorilla     
# Nile crocodile   
# White shark       
tl

  name        type
# elephant    mammal
# gorilla     mammal
# crocodile   reptile
# shark       fish

I think this is what you want to do

df<-csplit(df, splitcols="name", sep=" ")

The above command will split that column into two columns with name.1 and name.2 column names.

colnames(df)<-c("name","type")

The above command will give proper column names for merging

df_tl<-merge(x=df, y=tl, by="type",all=True)

The above code should give you the desired output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM