简体   繁体   中英

Find similarity between two dataframes, row by row

I have two dataframes, df1 and df2 with the same columns. I would like to find similarity between these two datasets. I have been following one of these two approaches. The first one was to append one of the two dataframes to the other one and selecting duplicates:

df=pd.concat([df1,df2],join='inner')
mask = df.Check.duplicated(keep=False)

df[mask] # it gives me duplicated rows

The second one is to consider a threshold value which, for each row from df1, finds a potential match in rows in df2.

Sample of data: Please note that the datasets have different length

For df1

Check
how to join to first row
large data work flows
I have two dataframes
fix grammatical or spelling errors
indent code by 4 spaces
why are you posting here?
add language identifier
my dad loves watching football 

and for df2

Check
small data work flows
I have tried to puzze out an answer
mix grammatical or spelling errors
indent code by 2 spaces
indent code by 8 spaces
put returns between paragraphs
add curry on the chicken curry
mom!! mom!! mom!!
create code fences with backticks
are you crazy? 
Trump did not win the last presidential election

In order to do this, I am using the following function:

def check(df1, thres, col):
    matches = df1.apply(lambda row: ((fuzz.ratio(row['Check'], col) / 100.0) >= thres), axis=1)
    return [df1. Check[i] for i, x in enumerate(matches) if x]

This should allow me to find rows which match.

The problem of the second approach (the one I most interested in) is that it actually does not take into account the two dataframes.

My expected value from the first function would be two dataframes, one for df1 and one for df2, having an extra column which includes the similarity found per each row compared to those in the other dataframe; from the second code, I should only assign a similarity value to them (I should have as many columns as the number of rows).

Please let me know if you need any further information or if you need more code. Maybe there is a easier way to determine this similarity, but unfortunately I have not found it yet.

Any suggestion is welcome.

Expected output:

(it is an example; since I am setting a threshold the output may change)

df1

Check                             sim
how to join to first row         []
large data work flows            [small data work flows]
I have two dataframes            []
fix grammatical or spelling errors [mix grammatical or spelling errors]
indent code by 4 spaces          [indent code by 2 spaces, indent code by 8 spaces]
why are you posting here?        []
add language identifier          []
my dad loves watching football   []

df2

Check                             sim
small data work flows                [large data work flows]
I have tried to puzze out an answer   []
mix grammatical or spelling errors    [fix grammatical or spelling errors]
indent code by 2 spaces               [indent code by 4 spaces]
indent code by 8 spaces               [indent code by 4 spaces]
put returns between paragraphs        []
add curry on the chicken curry        []
mom!! mom!! mom!!                     []
create code fences with backticks     []
are you crazy?                        []
Trump did not win the last presidential election    []

I think your fuzzywuzzy solution is pretty good. I've implemented what you're looking for below. That this will grow as len(df1)*len(df2) so is pretty memory intensive, but at least should be reasonably clear. You may find the output of gen_scores useful as well.

from fuzzywuzzy import fuzz 
from itertools import product

def gen_scores(df1, df2):
    # generates a score for all row combinations between dfs
    df_score = pd.DataFrame(product(df1.Check, df2.Check), columns=["c1", "c2"])
    df_score["score"] = df_score.apply(lambda row: (fuzz.ratio(row["c1"], row["c2"]) / 100.0), axis=1)
    return df_score

def get_matches(df1, df2, thresh=0.5):
    # get all matches above a threshold, appended as list to each df
    df = gen_scores(df1, df2)
    df = df[df.score > thresh]

    matches = df.groupby("c1").c2.apply(list)
    df1 = pd.merge(df1, matches, how="left", left_on="Check", right_on="c1")
    df1 = df1.rename(columns={"c2":"matches"})
    df1.loc[df1.matches.isnull(), "matches"] = df1.loc[df1.matches.isnull(), "matches"].apply(lambda x: [])

    matches = df.groupby("c2").c1.apply(list)
    df2 = pd.merge(df2, matches, how="left", left_on="Check", right_on="c2")
    df2 = df2.rename(columns={"c1":"matches"})
    df2.loc[df2.matches.isnull(), "matches"] = df2.loc[df2.matches.isnull(), "matches"].apply(lambda x: [])
    return (df1, df2)

# call code:
df1_match, df2_match = get_matches(df1, df2, thresh=0.5)

output:

                                               Check                                            matches
0                           how to join to first row                                                 []
1                              large data work flows                            [small data work flows]
2                              I have two dataframes                                                 []
3  fix grammatical or spelling errors [mix gramma...               [mix grammatical or spelling errors]
4                            indent code by 4 spaces  [indent code by 2 spaces, indent code by 8 spa...
5                          why are you posting here?                                   [are you crazy?]
6                            add language identifier                                                 []
7                     my dad loves watching football                                                 []

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM