简体   繁体   中英

compare columns of a sparse dataframe

The problem is as follows:

A sparse dataframe exists whereby the first column is a list of unique strings. All of the other columns represent lists of numbers on the range [0,1] and each list sums to 1.

What is the best way to get a matching score of the lists.

Initially, I was going to use the corr() function, but orthogonal (if that is the correct phrase) lists return negative numbers.

Alternatively, maybe a process of the product of the lists divided by the square of one of the lists (this would always return a positive number).

Here is an example:

      unique names  weight_1  weight_2
0     XY1052671234  0.000000  0.000000
1     XY1686846061  0.250000  0.000000
2     LM1962513674  0.250000  0.000000
3     LM1135334800  0.250000  0.000000
4     LM1292384960  0.250000  0.000000
           ...       ...       ...  
6958  AB0558521263  0.000000  0.000000
6959  CDH42097CS44  0.000000  0.500000
6960  CDH42097CB19  0.000000  0.500000
6961  EF1046224884  0.000000  0.000000
6962  GH96122UAA25  0.000000  0.000000

So what is the best way of getting a comparison score that ranges between 0 to 1 ?

where 0 means there is no relationship between the unique names and 1 means that they are identical.

What does it mean for there to be "no relationship"?

My guess is you want something like cosine similarity between the rows, which returns 1 if the rows are identical, and 0 if the rows have no shared non-zero positions. IE you treat the rows like vectors and compute the dot product between them.

import pandas as pd
import numpy as np

#get matrix of weight columns
weight_cols = df.loc[:, df.columns != 'unique name'].to_numpy()

#normalize cols to sum-square to 1 so we get dot products between 0 and 1
normalized_cols = (weight_cols.T/np.linalg.norm(weight_cols, axis=1)).T 

#take dot product of each row with itself
cosine_similarity_matrix = np.matmul(normalized_cols, normalized_cols.T)

This will give you a matrix where each index i,j is the cosine similarity between row i and row j , eg you get 1 on the diagonals where i==j .

Note, this is slightly inefficient since the matrix is symmetric, you only need to compute the upper right (or lower left) portion of the matrix, but it shouldn't be an issue unless this is going into production code that will be run constantly as it's only a factor of ~2 slowdown.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM