简体   繁体   中英

Python: Merging two columns of two different pandas dataframe using string matching

I am trying to perform string matching between two pandas dataframe.

df_1:
ID   Text           Text_ID
1    Apple            53
2    Banana           84
3    Torent File      77

df_2: 
ID   File_name      
22   apple_mask.txt
23   melon_banana.txt
24   Torrent.pdf
25   Abc.ppt

Objective: I want to populate the Text_ID against File_name in df_2 if the string in df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[ df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[ Text_ID ] as -1. So the resultant ] as -1. So the resultant df` looks like

ID   Flie_name           Text_ID
22   apple_mask.txt        53
23   melon_banana.txt      84
24   Torrent.pdf           77          
25   Abc.ppt               -1

I have tried this SO thread , but it is giving a column where File_name wise fuzz score is listed.

I am trying out a non fuzzy way. Please see below the code snippets:

text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
  if str(j) in str(i):
    t_i = df_1.loc[df_1['Text']==i,'Text_ID']
    text_id.append(t_i)
  else:
    t_i = -1
    text_id.append(t_i)
df_2['Text_ID'] = text_id

But I am getting a blank text_id list.

Can anybody provide some clue on this? I am OK to use fuzzywuzzy as well.

You can get it with the following code:

df2['Text_ID'] = -1    # set -1 by default for all the file names
for _,file_name in df2.iterrows():
    for _, text in df1.iterrows():     
        if text[0].lower() in file_name[0]:  # compare strings
            df2.loc[df2.File_name == file_name[0],'Text_ID'] = text[1] # assaign the Text_ID from df1 in df2
            break

Keep in mind:

  • String comparison: As it is now working, apple and banana are contained in apple_mask.txt and melon_banana.txt , but torrent file is not in torrent.pdf . Consider redefining the string comparison.
  • df.iterrows() returns two values, the index of the row and the values of the row, in this case I have replaced the index by _ since it is not necessary to solve this problem

result:

df2
          File_name  Text_ID
0   apple_mask.text       53
1  melon_banana.txt       84
2       Torrent.pdf       -1
3           Abc.ppt       -1

You can try following code:

text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
      if j.lower().find(i.lower()) == -1:
        t_i = -1
        df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
      else:
        t_i = df_1.loc[df_1['Text']==i,'Text_ID']
        df_2.loc[df_2['File_name']==j,'Text_ID']=t_i

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM