I have two dataframes, df1 and df2 with the same columns. I would like to find similarity between these two datasets. I have been following one of these two approaches. The first one was to append one of the two dataframes to the other one and selecting duplicates:
df=pd.concat([df1,df2],join='inner')
mask = df.Check.duplicated(keep=False)
df[mask] # it gives me duplicated rows
The second one is to consider a threshold value which, for each row from df1, finds a potential match in rows in df2.
Sample of data: Please note that the datasets have different length
For df1
Check
how to join to first row
large data work flows
I have two dataframes
fix grammatical or spelling errors
indent code by 4 spaces
why are you posting here?
add language identifier
my dad loves watching football
and for df2
Check
small data work flows
I have tried to puzze out an answer
mix grammatical or spelling errors
indent code by 2 spaces
indent code by 8 spaces
put returns between paragraphs
add curry on the chicken curry
mom!! mom!! mom!!
create code fences with backticks
are you crazy?
Trump did not win the last presidential election
In order to do this, I am using the following function:
def check(df1, thres, col):
matches = df1.apply(lambda row: ((fuzz.ratio(row['Check'], col) / 100.0) >= thres), axis=1)
return [df1. Check[i] for i, x in enumerate(matches) if x]
This should allow me to find rows which match.
The problem of the second approach (the one I most interested in) is that it actually does not take into account the two dataframes.
My expected value from the first function would be two dataframes, one for df1 and one for df2, having an extra column which includes the similarity found per each row compared to those in the other dataframe; from the second code, I should only assign a similarity value to them (I should have as many columns as the number of rows).
Please let me know if you need any further information or if you need more code. Maybe there is a easier way to determine this similarity, but unfortunately I have not found it yet.
Any suggestion is welcome.
Expected output:
(it is an example; since I am setting a threshold the output may change)
df1
Check sim
how to join to first row []
large data work flows [small data work flows]
I have two dataframes []
fix grammatical or spelling errors [mix grammatical or spelling errors]
indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spaces]
why are you posting here? []
add language identifier []
my dad loves watching football []
df2
Check sim
small data work flows [large data work flows]
I have tried to puzze out an answer []
mix grammatical or spelling errors [fix grammatical or spelling errors]
indent code by 2 spaces [indent code by 4 spaces]
indent code by 8 spaces [indent code by 4 spaces]
put returns between paragraphs []
add curry on the chicken curry []
mom!! mom!! mom!! []
create code fences with backticks []
are you crazy? []
Trump did not win the last presidential election []
I think your fuzzywuzzy solution is pretty good. I've implemented what you're looking for below. That this will grow as len(df1)*len(df2)
so is pretty memory intensive, but at least should be reasonably clear. You may find the output of gen_scores
useful as well.
from fuzzywuzzy import fuzz
from itertools import product
def gen_scores(df1, df2):
# generates a score for all row combinations between dfs
df_score = pd.DataFrame(product(df1.Check, df2.Check), columns=["c1", "c2"])
df_score["score"] = df_score.apply(lambda row: (fuzz.ratio(row["c1"], row["c2"]) / 100.0), axis=1)
return df_score
def get_matches(df1, df2, thresh=0.5):
# get all matches above a threshold, appended as list to each df
df = gen_scores(df1, df2)
df = df[df.score > thresh]
matches = df.groupby("c1").c2.apply(list)
df1 = pd.merge(df1, matches, how="left", left_on="Check", right_on="c1")
df1 = df1.rename(columns={"c2":"matches"})
df1.loc[df1.matches.isnull(), "matches"] = df1.loc[df1.matches.isnull(), "matches"].apply(lambda x: [])
matches = df.groupby("c2").c1.apply(list)
df2 = pd.merge(df2, matches, how="left", left_on="Check", right_on="c2")
df2 = df2.rename(columns={"c1":"matches"})
df2.loc[df2.matches.isnull(), "matches"] = df2.loc[df2.matches.isnull(), "matches"].apply(lambda x: [])
return (df1, df2)
# call code:
df1_match, df2_match = get_matches(df1, df2, thresh=0.5)
output:
Check matches
0 how to join to first row []
1 large data work flows [small data work flows]
2 I have two dataframes []
3 fix grammatical or spelling errors [mix gramma... [mix grammatical or spelling errors]
4 indent code by 4 spaces [indent code by 2 spaces, indent code by 8 spa...
5 why are you posting here? [are you crazy?]
6 add language identifier []
7 my dad loves watching football []
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.