简体   繁体   中英

Scoring the similarity between two columns in a data frame

I have a data frame with about 47 columns. Of these columns, I need to only compare 2 of them. What I am trying to do is score the similarity between the two columns of the row and return this score in a new column. Asking, are they the same or close enough to being the same. I am not trying to search the data set for a better match. Simply score the row as it stands. I am using the Fuzzy Wuzzy package, but my code keeps spitting out an error. The code I am using is:

import pandas as pd
from fuzzywuzzy import fuzz

df['score'] = df.apply(fuzz.token_sort_ratio(df['FullAddress_x'].astype(str), df['FullAddress_y'].astype(str)))

The error that I am getting is:

TypeError: ("'int' object is not callable", 'occurred at index LineID_x')

I do not want the Line ID to be considered and it cannot be removed as it is required to link to the original dataset. I only want the columns that are specified to be considered. I am not certain what I am doing wrong. I am also not stuck on having to use this package. I am open to others. I just do not know any others that would do this.

As an example: If I matched 123 Main St. to 123 Main Street. I would want my results to be

Col 1, Col 2, Score

123 Main St., 123 Main Street, 95

The other similar questions on stack have not been helpful in resolving this matter. Any assistance would be wonderful. Do let me know if further clarification is needed. Thank you in advance for your time.

Edit 1:

Example Data Set:

LineID.1_x,FullAddress_x,LineID.1_y,FullAddress_y 0,123 main st,540,123 main street 1,258 green st,541,258 green st 2,324 blue st,542,324 purple rd 3,345 red st,543,345 red st 4,349 orange st,544,3456 airport rd

Please note that the example data set is significantly smaller. The data set will also contain Dates, zip codes, and various other forms that I do not want considered. I hope this helps.

Edit 2: Also tried the following code as someone had suggested, but it also resulted in an error. That suggestion was deleted by the user.

df['score'] = df[['FullAddress_x', 'FullAddress_y']].apply(fuzz.token_sort_ratio(df['FullAddress_x'].astype(str), df['FullAddress_y'].astype(str)))

Resulted in the error:

TypeError: ("'int' object is not callable", 'occurred at index FullAddress_x')

您可以尝试以下方法吗?

df['score'] = df.apply(lambda row: fuzz.token_sort_ratio(row['FullAddress_x'], row['FullAddress_y']))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM