简体   繁体   中英

Python cosine_similarity doesn't work for matrix with NaNs

Need to find python function that works like this R func:

proxy::simil(method = "cosine", by_rows = FALSE) 

ie finds similarity matrix by pair-wise calculating cosine distance between dataframe rows. If NaNs are present, it should drop exact columns with NaNs in these 2 rows

Simil function description (R)

Python error because of NaNs

upd . I have also tried to delete NaNs in every pair of rows in loop using cosine func from scipy.spatial.distance. It gives the same result as in R, but works ages :(

You can try this approach: https://github.com/Midnighter/nadist , alternatively you can use _chk_weights with nan_screen=True as described here by metaperture here https://github.com/scipy/scipy/issues/3870 , hope that helps.

I have found that Midnighter had posted the same problem previously on stackoverflow: Compute the pairwise distance in scipy with missing values . There are some other solutions there but, as he moved on to cytonize it I bet they were not the best.

I solved the problem by creating a mask (boolean array indicating which values are missing) and calculating pairwise cosine distances between row-vectors of matrix. As a result I received a long vector of similarities, which I then pivoted to get the similarity matrix

您可以将NaN0交换,然后尝试计算余弦相似度。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM