简体   繁体   English

通过比较替换 python 中列表/数据框列中的字符串

[英]Replacing the strings in list/dataframe column in python by comparison

I have the following dataframe column我有以下 dataframe 专栏

Hotels
Hotel Tulsi Viha
Hotel Tulsi Vih
Hotel Tulsi Vihar
SWIGGYBang
Swiggy
Borivali Biryani center
Borivali Biryani centr

I want to check for string similarity and replace it:- For eg Hotel Tulsi Vih, Hotel Tulsi Viha, Hotel Tulsi Vihar are same, so I want string "Hotel Tulsi Vihar" that replaces the other 2 strings.我想检查字符串相似性并替换它:- 例如,Hotel Tulsi Vih、Hotel Tulsi Viha、Hotel Tulsi Vihar 是相同的,所以我想要字符串“Hotel Tulsi Vihar”来替换其他 2 个字符串。 Similarly it needs to be done on entire column同样,它需要在整个列上完成

You can apply fuzzywuzzy library to check the similar string.您可以应用fuzzywuzzy库来检查相似的字符串。

For example, you have a main list例如,您有一个主列表

from fuzzywuzzy import process, fuzz

main_list = ['Hotel Tulsi Vihar','Swiggy','Borivali Biryani center']

## compare the similar string to a list, as below
process.extract("Hotel Tulsi Viha", main_list , limit=2, scorer=fuzz.token_sort_ratio)

## Result -> [('Hotel Tulsi Vihar', 97), ('Swiggy', 18), ('Borivali Biryani center', 15)]

So, you can apply this function to a dataframe因此,您可以将此 function 应用于 dataframe

## Create/Import dataframe
df = pd.DataFrame(['Hotels','Hotel Tulsi Viha','Hotel Tulsi Vih','Hotel Tulsi Vihar'\
,'SWIGGYBang','Swiggy','Borivali Biryani center','Borivali Biryani centr'],columns=['Brand'])

## This is example function to compare word with a list and threshold. 
def check_fuzzywuzzy(word, list_unique,threshold):
    highest_pos = process.extract(word, list_unique, limit=1, scorer=fuzz.token_sort_ratio)
    if highest_pos[0][1] > threshold:
        return highest_pos[0][0]
    else:
        return word

## Apply function
df['new_column'] = df.apply(lambda x: check_fuzzywuzzy(x['Brand'], main_list,60), axis=1)

# Result
#                    Brand          new_column_name
#0                   Hotels                   Hotels
#1         Hotel Tulsi Viha        Hotel Tulsi Vihar
#2          Hotel Tulsi Vih        Hotel Tulsi Vihar
#3        Hotel Tulsi Vihar        Hotel Tulsi Vihar
#4               SWIGGYBang                   Swiggy
#5                   Swiggy                   Swiggy
#6  Borivali Biryani center  Borivali Biryani center
#7   Borivali Biryani centr  Borivali Biryani center

Then, you can use the new column, or you can apply to the same column, 'Brand'.然后,您可以使用新列,也可以应用到同一列“品牌”。

For the fuzzywuzzy library, it has several score for checking the similar string, eg, token_set_ratio, partial_ratio.对于fuzzywuzzy 库,它有几个用于检查相似字符串的分数,例如token_set_ratio、partial_ratio。 More information is in this link .更多信息在这个链接中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM