I am working with CSV files and I have a code that calculates the similarity between the documents. Post 1 provide the code and details of data and output is as follow:
The data.csv looks as:
idx messages
112 I have a car and it is blue
114 I have a bike and it is red
115 I don't have any car
117 I don't have any bike
The output is:
id 112 114 115 117
id
112 100.0 78.0 51.0 50.0
114 78.0 100.0 47.0 54.0
115 51.0 47.0 100.0 83.0
117 50.0 54.0 83.0 100.0
Now I would like to calculate the mean and standard deviation of the lower triangular of the similarity matrix (since both upper and lower are similar) without the identity data (100.0).
I tried to use the panda built-in mean and std as:
df_std = df.std()
df_Mean = df.mean()
But this considers all the data in the output like identity and upper triangular.
I would like to know if there is any way that I can calculate the mean and standard deviation the way that I mentioned.
Use numpy.tril
with k=-1
and make 0s np.nan
:
import numpy as np
ltri = np.tril(df.values, -1)
ltri = ltri[np.nonzero(ltri)]
Output:
array([[ 0., 0., 0., 0.],
[78., 0., 0., 0.],
[51., 47., 0., 0.],
[50., 54., 83., 0.]])
And now you can do ltri.std()
, ltri.mean()
:
ltri.std(), ltri.mean()
# (14.361406616345072, 60.5)
You can do it with mask all of the unwanted value as np.nan
df.values[np.triu_indices_from(df.values,0)]=np.nan
df.mean()
112 59.666667
114 50.500000
115 83.000000
117 NaN
dtype: float64
df.std()
112 15.885003
114 4.949747
115 NaN
117 NaN
dtype: float64
After mask the value
df
112 114 115 117
112 NaN NaN NaN NaN
114 78.0 NaN NaN NaN
115 51.0 47.0 NaN NaN
117 50.0 54.0 83.0 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.