简体   繁体   中英

How to calculate the mean and standard deviation of similarity matrix?

I am working with CSV files and I have a code that calculates the similarity between the documents. Post 1 provide the code and details of data and output is as follow:

The data.csv looks as:

idx         messages
112  I have a car and it is blue
114  I have a bike and it is red
115  I don't have any car
117  I don't have any bike

The output is:

    id     112    114    115    117
    id                             
    112  100.0   78.0   51.0   50.0
    114   78.0  100.0   47.0   54.0
    115   51.0   47.0  100.0   83.0
    117   50.0   54.0   83.0  100.0

Now I would like to calculate the mean and standard deviation of the lower triangular of the similarity matrix (since both upper and lower are similar) without the identity data (100.0).

I tried to use the panda built-in mean and std as:

df_std = df.std()
df_Mean = df.mean()

But this considers all the data in the output like identity and upper triangular.

I would like to know if there is any way that I can calculate the mean and standard deviation the way that I mentioned.

Use numpy.tril with k=-1 and make 0s np.nan :

import numpy as np

ltri = np.tril(df.values, -1)
ltri = ltri[np.nonzero(ltri)]

Output:

array([[ 0.,  0.,  0.,  0.],
       [78.,  0.,  0.,  0.],
       [51., 47.,  0.,  0.],
       [50., 54., 83.,  0.]])

And now you can do ltri.std() , ltri.mean() :

ltri.std(), ltri.mean()
# (14.361406616345072, 60.5)

You can do it with mask all of the unwanted value as np.nan

df.values[np.triu_indices_from(df.values,0)]=np.nan
df.mean()
112    59.666667
114    50.500000
115    83.000000
117          NaN
dtype: float64
df.std()
112    15.885003
114     4.949747
115          NaN
117          NaN
dtype: float64

After mask the value

df
      112   114   115  117
112   NaN   NaN   NaN  NaN
114  78.0   NaN   NaN  NaN
115  51.0  47.0   NaN  NaN
117  50.0  54.0  83.0  NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM