简体   繁体   中英

Replace similar strings in a column with the same string

I have a Pandas dataframe that collects the names of vendors at which a transaction was made. As this data is automatically collected from bank statements, lots of the vendors are similar... but not quite the same. In summary, I want to replace the different permutations of the vendors' names with a single name.

I think I can work out a way to do it (see below), but I'm a beginner and this seems to me like it's a complex problem. I'd be really interested to see how more experienced coders would approach it.

I have a dataframe like this (in real life, it's about 20 columns and a maximum of around 50 rows):

     Groceries            Car                Luxuries
0    Sainsburys           Texaco wst453      Amazon
1    Sainsburys bur       Texaco east        Firebox Ltd
2    Sainsbury's east     Shell wstl         Sony
3    Tesco                Shell p/stn        Sony ent nrk
4    Tescos ref 657       Texac              Amazon EU
5    Tesco 45783          Moto               Amazon marketplace

I'd like to find the similar entries and replace them with the first instance of those entries, so I'd end up with this:

     Groceries            Car                Luxuries
0    Sainsburys           Texaco wst453      Amazon
1    Sainsburys           Texaco wst453      Firebox Ltd
2    Sainsburys           Shell wstl         Sony
3    Tesco                Shell wstl         Sony
4    Tesco                Texaco wst453      Amazon
5    Tesco                Moto               Amazon

My solution might be far from optimum. I was thinking of sorting alphabetically, then going through bitwise and using something like SequenceMatcher from difflib to compare each pair of vendors. If the similarity is above a certain percentage (I'm expecting to play with this value until I'm happy) then the two vendors will be assumed to be the same. I'm concerned that I might be using a sledgehammer to crack a nut, or it might take a long time (I'm not obsessed with performance, but equally I don't want to wait hours for the result).

Really interested to hear people's thoughts on this problem!

At the start, the problem doesn't seem complicated, but it is.

I used string similarity package named fuzzywuzzy to decide which string must be replaced. This package uses Levenshtein Similarity, and I used %90 as the threshold value. Also, the first word of any string is used as comparison string. Here is my code:

import pandas
from fuzzywuzzy import fuzz

# Replaces %90 and more similar strings  
def func(input_list):
    for count, item in enumerate(input_list):
        rest_of_input_list = input_list[:count] + input_list[count + 1:]
        new_list = []
        for other_item in rest_of_input_list:
            similarity = fuzz.ratio(item, other_item)
            if similarity >= 90:
                new_list.append(item)
            else:
                new_list.append(other_item)
        input_list = new_list[:count] + [item] + new_list[count :]
                
    return input_list

df = pandas.read_csv('input.txt') # Read data from csv
result = []
for column in list(df):
    column_values = list(df[column])
    first_words = [x[:x.index(" ")] if " " in x else x for x in column_values]
    result.append(func(first_words))
    
new_df = pandas.DataFrame(result).transpose() 
new_df.columns = list(df)

print(new_df)

Output:

     Groceries    Car Luxuries
0  Sainsbury's  Texac   Amazon
1  Sainsbury's  Texac  Firebox
2  Sainsbury's  Shell     Sony
3        Tesco  Shell     Sony
4        Tesco  Texac   Amazon
5        Tesco   Moto   Amazon

UPDATE:

More readable version of func , which produces the same result:

def func(input_list):
    for i in range(len(input_list)):
        for j in range(len(input_list)):
            if i < j and fuzz.ratio(input_list[i], input_list[j]) >= 90:
                input_list[i] = input_list[j] # Keep the last encountered item
                # Use following line to keep the first encountered item
                # input_list[j] = input_list[i]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM