I am trying to perform string matching between two pandas dataframe.
df_1:
ID Text Text_ID
1 Apple 53
2 Banana 84
3 Torent File 77
df_2:
ID File_name
22 apple_mask.txt
23 melon_banana.txt
24 Torrent.pdf
25 Abc.ppt
Objective: I want to populate the Text_ID
against File_name
in df_2 if the string in df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[
df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[
Text_ID ] as -1. So the resultant
] as -1. So the resultant
df` looks like
ID Flie_name Text_ID
22 apple_mask.txt 53
23 melon_banana.txt 84
24 Torrent.pdf 77
25 Abc.ppt -1
I have tried this SO thread , but it is giving a column where File_name wise fuzz score is listed.
I am trying out a non fuzzy way. Please see below the code snippets:
text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
if str(j) in str(i):
t_i = df_1.loc[df_1['Text']==i,'Text_ID']
text_id.append(t_i)
else:
t_i = -1
text_id.append(t_i)
df_2['Text_ID'] = text_id
But I am getting a blank text_id
list.
Can anybody provide some clue on this? I am OK to use fuzzywuzzy as well.
You can get it with the following code:
df2['Text_ID'] = -1 # set -1 by default for all the file names
for _,file_name in df2.iterrows():
for _, text in df1.iterrows():
if text[0].lower() in file_name[0]: # compare strings
df2.loc[df2.File_name == file_name[0],'Text_ID'] = text[1] # assaign the Text_ID from df1 in df2
break
Keep in mind:
apple
and banana
are contained in apple_mask.txt
and melon_banana.txt
, but torrent file
is not in torrent.pdf
. Consider redefining the string comparison.df.iterrows()
returns two values, the index
of the row and the values
of the row, in this case I have replaced the index by _
since it is not necessary to solve this problem result:
df2
File_name Text_ID
0 apple_mask.text 53
1 melon_banana.txt 84
2 Torrent.pdf -1
3 Abc.ppt -1
You can try following code:
text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
if j.lower().find(i.lower()) == -1:
t_i = -1
df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
else:
t_i = df_1.loc[df_1['Text']==i,'Text_ID']
df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.