I have been following the code on this link to find the similarity measure between the input X and Y:
def similarity(X, Y, method):
X = np.mat(X)
Y = np.mat(Y)
N1, M = np.shape(X)
N2, M = np.shape(Y)
method = method[:3].lower()
if method=='smc': # SMC
X,Y = binarize(X,Y);
sim = ((X*Y.T)+((1-X)*(1-Y).T))/M
return sim
def binarize(X,Y=None):
''' Force binary representation of the matrix, according to X>median(X) '''
if Y==None:
X = np.matrix(X)
Xmedians = np.ones((np.shape(X)[0],1)) * np.median(X,0)
Xflags = X>Xmedians
X[Xflags] = 1; X[~Xflags] = 0
return X
else:
X = np.matrix(X); Y = np.matrix(Y);
XYmedian= np.median(np.bmat('X; Y'),0)
Xmedians = np.ones((np.shape(X)[0],1)) * XYmedian
Xflags = X>Xmedians
X[Xflags] = 1; X[~Xflags] = 0
Ymedians = np.ones((np.shape(Y)[0],1)) * XYmedian
Yflags = Y>Ymedians
Y[Yflags] = 1; Y[~Yflags] = 0
return [X,Y]
However, it assumes that the input X and Y should be N1 * M
and N2 * M
dimensional matrices respectively. I am confused at how to convert my input that are variable length sentences into the required input format.
Also, I would be grateful if someone could suggest me some other method to find the same.
How about this:
import pandas as pd
df1=pd... #I'd like to see how you generate your data
df2=pd...
cols_common=list(set(df1.columns).intersection(df2.columns))
df1=df1[cols_common]
df2=df2[cols_common]
result=similarity(df1,df2,'smc')
Of course, this approach presumes that the two tables have one or more columns in common. you could also delete columns from the larger dataframe arbitrarily, but I wouldn't recommend this without knowing your use case
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.