I have text column in df1 and text column in df2. The length of df2 will be different to that of length of df1. I want to calculare cosine similarity for every entry in df1[text] against every entry in df2[text] and give a score for every match.
sample input
df1
mahesh
suresh
df2
surendra
mahesh
shrivatsa
suresh
maheshwari
sample output
mahesh surendra 30
mahesh mahesh 100
mahesh shrivatsa 20
mahesh suresh 60
mahesh maheshwari 80
suresh surendra 70
suresh mahesh 60
suresh shrivatsa 40
suresh suresh 100
suresh maheshwari 30
i was facing issues( getting key errors) when I was trying to match these two columns for similarity using tf-idf approach as these columns were of different lengths . is there any other way through we can solve this problem... Any help would be greatly appreicated. I have searched a lot and found that in almost all cases people were comparing the first document against rest of documents in the same corpus. here it is like comparing every document of corpus 1 with every document on corpus2 .
There are many different string distance measures. I can't be sure how to use cosine similarity for this case, though I suggest looking into a strsim
library.
I'll give you an example of how I would approach the issue using Jaro-Winkler
metric which is best suited for short strings.
Also, I'm including my attempt to use cosine similarity
given the example from the documentation of said library.
It could be completely wrong but should give you a general idea of how to make dataframe from the cartesian product of two columns of different lengths, as well as how to apply strsim
's algorithms to the data stored in pd.DataFrame
Data preparation :
import pandas as pd
from similarity.jarowinkler import JaroWinkler
from similarity.cosine import Cosine
df1 = pd.DataFrame({
"name": ["mahesh", "suresh"]
})
df2 = pd.DataFrame({
"name": ["mahesh", "surendra", "shrivatsa", "suresh", "maheshwari"]
})
df = pd.MultiIndex.from_product(
[df1["name"], df2["name"]], names=["col1", "col2"]
).to_frame(index=False)
returns:
col1 col2
0 mahesh mahesh
1 mahesh surendra
2 mahesh shrivatsa
3 mahesh suresh
4 mahesh maheshwari
5 suresh mahesh
6 suresh surendra
7 suresh shrivatsa
8 suresh suresh
9 suresh maheshwari
Jaro-Winkler :
jarowinkler = JaroWinkler()
df["jarowinkler_sim"] = [jarowinkler.similarity(i,j) for i,j in zip(df["col1"],df["col2"])]
returns:
col1 col2 jarowinkler_sim
0 mahesh mahesh 1.0
1 mahesh surendra 0.4305555555555555
2 mahesh shrivatsa 0.5185185185185185
3 mahesh suresh 0.6666666666666666
4 mahesh maheshwari 0.9466666666666667
5 suresh mahesh 0.6666666666666666
6 suresh surendra 0.8333333333333334
7 suresh shrivatsa 0.611111111111111
8 suresh suresh 1.0
9 suresh maheshwari 0.48888888888888893
Cosine similarity :
cosine = Cosine(2)
df["p0"] = df["col1"].apply(lambda s: cosine.get_profile(s))
df["p1"] = df["col2"].apply(lambda s: cosine.get_profile(s))
df["cosine_sim"] = [cosine.similarity_profiles(p0,p1) for p0,p1 in zip(df["p0"],df["p1"])]
df.drop(["p0", "p1"], axis=1)
returns:
col1 col2 cosine_sim
0 mahesh mahesh 0.9999999999999998
1 mahesh surendra 0.0
2 mahesh shrivatsa 0.15811388300841897
3 mahesh suresh 0.3999999999999999
4 mahesh maheshwari 0.7453559924999299
5 suresh mahesh 0.3999999999999999
6 suresh surendra 0.5070925528371099
7 suresh shrivatsa 0.15811388300841897
8 suresh suresh 0.9999999999999998
9 suresh maheshwari 0.29814239699997197
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.