简体   繁体   中英

Return key on fuzzy match of element in dictionary list

I have a dataframe like this:

Date Cost Category Vendor
2021-03-22 - FamilyMart
2021-03-04 - FAMILY MART
2021-03-14 - Subway MAIN
2021-03-14 - OTHER
2021-03-14 - Transit Authority
2021-03-09 - Subway local
2021-03-24 - Seven Eleven
2021-03-14 - Seven-Eleven

I want to add category tags like this:

Date Cost Category Vendor
2021-03-22 Store FamilyMart
2021-03-04 Store FAMILY MART
2021-03-14 Dining Subway MAIN
2021-03-14 - OTHER
2021-03-14 - Transit Authority
2021-03-09 Dining Subway local
2021-03-24 Store Seven Eleven
2021-03-14 Store Seven-Eleven

I try the following, which would just return the value of the matching element in the list:

from fuzzywuzzy import process
from fuzzywuzzy import fuzz

Store = ['Family Mart', 'Seven Eleven', 'York Mart', 'Tokyu', 'Ministop']
Dining = ['Subway', 'Salad Works'] 

def fuzz_m(col, cat_list, score_t):
    tag, score = process.extractOne(col, cat_list, scorer = score_t)
    if score < 51:
        return ''
    else:
        return tag
    
df['Cost Category'] = df['Vendor'].apply(fuzz_m, cat_list = Store, score_t = fuzz.ratio)
Date Cost Category Vendor
2021-03-22 Family Mart FamilyMart
2021-03-04 Family Mart FAMILY MART
2021-03-14 - Subway MAIN
2021-03-14 - OTHER
2021-03-14 - Transit Authority
2021-03-09 - Subway local
2021-03-24 Seven Eleven Seven Eleven
2021-03-14 Seven Eleven Seven-Eleven

What I want to do is use a dictionary in place of cat_list and return the key in Cost Category.

dictionary = {'Store':['Family Mart', 'Seven Eleven', 'York Mart', 'Tokyu', 'Ministop'],
                  'Dining':['Subway', 'Salad Works']
                 } 

Where if any value in the column has a 51+ match to an element in a list, then I want to add the key under Cost Category. If it is a low match (below 51) I want to do nothing.

Is there a feasible approach to achieve this?

With Series.apply() , fuzz_m() receives one Vendor value at a time, so you can use that dictionary directly as extractOne(value, dictionary) :

def fuzz_m(value):
    _, score, tag = process.extractOne(value, dictionary)
    return tag if score > 50 else '-'

df['Cost Category'] = df['Vendor'].apply(fuzz_m)

#          Date  Cost Category             Vendor
# 0  2021-03-22          Store         FamilyMart
# 1  2021-03-04          Store        FAMILY MART
# 2  2021-03-14         Dining        Subway MAIN
# 3  2021-03-14              -              OTHER
# 4  2021-03-14              -  Transit Authority
# 5  2021-03-09         Dining       Subway local
# 6  2021-03-24          Store       Seven Eleven
# 7  2021-03-14          Store       Seven-Eleven

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM