简体   繁体   中英

How to add a pandas column based on partial string match?

I have a pandas dataframe of credit card expenses of various yet-to-be-defined categories (gas, groceries, fast food, etc.).

df1: 

Category   Date         Description                 Cost 
nan        7.1.20       Chipotle Downtown West      $8.23
nan        7.1.20       Break Time - Springfield    $23.57
nan        7.3.20       State Farm - Agent          $94.23
nan        7.3.20       T-Mobile                    $132.42
nan        7.4.20       Venmo -xj8382dzavvd         $8.00
nan        7.6.20       Broadway McDonald's         $11.73
nan        7.8.20       Break Time - Townsville     $44.23

I would like to maintain a second dataframe which searches for keywords in the description and populates the "Category" column. Something as follows:

df2:

item           category
mcdonald       fast food
state farm     insurance
break time     gas
chipotle       fast food
mobile         cell phone 

The idea here is that I would write lines of code to search for partial strings in df1['Description'] and populate df1['Category'] with the value in df2[category] .

I'm sure there is a clean and pythonic way to handle this code, but below is the closest I can get. The erroneous result of the code below is that all rows of df1['Category'] containing a match are set to the last loop in df2 (eg in this case, all rows would be set to "cell phone").

    for x in df2['item']:
        for y in df2['category']:
            df1['Category'] = np.where(
                        df1['Description'].str.lower().str.contains(x),
                        y,
                        df1['Category'])

Thanks for your help!

You can do this with map, Python's builtin difflib get close matches function, and a lambda expression. The difflib call returns a list of string matches and you can adjust the cutoff param for more or less sensitivity as needed.

import difflib

# you'll need to change both cutoff values here for the lambda to work correctly

df1['Category'] = df1['Description'].map(lambda x: difflib.get_close_matches(x, df2['item'], cutoff=0.3)[0] if len(difflib.get_close_matches(x, df2['item'], cutoff=0.3)) > 1 else 'no match')

print(df1)


    Category    Date    Description                 Cost
0   chipotle    7.1.20  Chipotle Downtown West      $8.23
1   break time  7.1.20  Break Time - Springfield    $23.57
2   state farm  7.3.20  State Farm - Agent          $94.23
3   mobile      7.3.20  T-Mobile                    $132.42
4   no match    7.4.20  Venmo -xj8382dzavvd         $8.00
5   mcdonald    7.6.20  Broadway McDonald's         $11.73
6   break time  7.8.20  Break Time - Townsville     $44.23

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM