简体   繁体   中英

Find similar phrases in a column in Pandas

I am using Pandas to work on a dataframe that has a column with the name of companies.

For each company name, several versions of it are available. df is an example:

df = pd.DataFrame({'id':['a','b','c','d','e','f'],'company name':['name1', ' Name1 LTD', 'name1, ltd.','name 1 LT.D.',' name2 p.p.c', 'name2 ppc.']})

I was wondering if there is a simple way to find similar names and assign a unique id to them? For the example above, I would like to get something like:

dg = pd.DataFrame({'id':['a','a','a','a','e','e'],'company name':['name1', ' Name1 LTD', 'name1, ltd.','name 1 LT.D.',' name2 p.p.c', 'name2 ppc.']})

Thanks,

One of the things I've done is use regex or a function that processes the raw strings that strips out all the extra like ltd and arbitrary special characters. Then create a processed string that is their "true name" and create an index of id's based on the "true name".

Or you can use fuzzywuzzy to find the distance between two strings and build a candidate set of matches and build an index of unique names based on match scores.

ie

def clean_str(x):
    x2 = x.lower()
    x2 = x2.replace('.', '')
    return x2

I feel your problem, like any programming problem, needs to be broken down into smaller pieces. I will break this down step by step as I understood it, AND as I would approach it.

Step 1. Clean(make uniform) your company name values. Here is a good article on data cleansing and why it is important.

Step 2. Map your id based on unique company names(this step is easy once step 1 is done)


import pandas as pd

df = pd.DataFrame({'id':['a','b','c','d','e','f'],'company_name':['na,me1', ' Name1 LTD', 'name1, ltd.','name 1 LT.D.',' name2 p.p.c', 'name2 ppc.']})

Step 1.
Clean it by using extract with regex
keep in mind that below regex only captures the small sample you provided and you may need to design the pattern to work with your full dataset

df['new_company_name'] = (df['company_name']
                         .str.lower()                    # lowercase to standardize
                         .str.replace(' |,','')          # remove extra characters, this may vary for your full dataset
                         .str.extract(r'^(\w+\d{1})'))   # pattern to extract the vital part of the name, this also will vary based on your data

print(df)

  id  company_name new_company_name
0  a         name1            name1
1  b     Name1 LTD            name1
2  c   name1, ltd.            name1
3  d  name 1 LT.D.            name1
4  e   name2 p.p.c            name2
5  f    name2 ppc.            name2

Step 2
It is advisable to use numerical values rather than str because of performance

Option 1 using groupby() with ngroup()

df['new_id'] = df.groupby('new_company_name').ngroup()

Option 2 using zip() , dict() , and then map()

unique_names = df.new_company_name.unique()
mapper = dict(
    zip(unique_names,
        [name_id for name_id in range(len(unique_names))]
    )
)
df['new_id'] = df['new_company_name'].map(mapper)

Same result for Option 1 or Option 2

print(df)

  id  company_name new_company_name  new_id
0  a         name1            name1       0
1  b     Name1 LTD            name1       0
2  c   name1, ltd.            name1       0
3  d  name 1 LT.D.            name1       0
4  e   name2 p.p.c            name2       1
5  f    name2 ppc.            name2       1

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM