简体   繁体   中英

Does numpy csr matrix “mean” function do the mean on all of the matrix? How can I remove a certain value?

I have a numpy csr matrix and I want to get it's mean, but it contains a lot of zeros, because I eliminated all values that are on the main diagonal and below it taking only the upper triangle values, and now my csr matrix when converted to array looks like that:

   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.63646664  0.34827262
   0.24316454  0.1362165   0.63646664  0.15762204  0.31692202  0.12114576
   0.35917146

As far as I understand this zeros are important to be there in order for the csr matrix to work and display things like this:

(0,5) 0.5790418
(3,10) 0.578210
(5,20) 0.912370
(67,5) 0.1093109

I saw that csr matrix has it's own mean function , but does this mean function takes into account all the zeros, therefore dividing on the number of elements in the array including the zeros? Because I need the mean on only the non zero values. My matrix contains the similarities between multiple vectors and is more like a list of matrices something like that:

[[ 0.          0.63646664  0.48492084  0.42134077  0.14366401  0.10909745
   0.06172853  0.08116201  0.19100626  0.14517247  0.23814955  0.1899649
   0.20181049  0.25663533  0.21003358  0.10436352  0.2038447   1.
   0.63646664  0.34827262  0.24316454  0.1362165   0.63646664  0.15762204
   0.31692202  0.12114576  0.35917146]
 [ 0.          0.          0.58644824  0.4977052   0.15953415  0.46110612
   0.42580993  0.3236768   0.48874263  0.44671607  0.59153001  0.57868948
   0.27357541  0.51645488  0.43317846  0.50985032  0.37317457  0.63646664
   1.          0.51529235  0.56963948  0.51218525  1.          0.38345582
   0.55396192  0.32287605  0.46700191]
 [ 0.          0.          0.          0.6089113   0.53873289  0.3367261
   0.29264493  0.13232082  0.43288206  0.80079927  0.37842518  0.33658945
   0.61990095  0.54372307  0.49982101  0.23555037  0.39283379  0.48492084
   0.58644824  0.64524906  0.31279271  0.39476181  0.58644824  0.39028705
   0.43856802  0.32296735  0.5541861 ]]

So how can I take the mean on only the non-zero values?

My other question is how can I remove all values that are equal to something, as I pointed out above I probably have to turn the certain value to a zero? But how do I do that ? For example I want to get rid of all values that are equal to 1.0 or bigger? Here is the code I have till this point to make my matrix:

vectorized_words = parse.csr_matrix(vectorize_words(nostopwords,glove_dict))

#calculating the distance/similarity between each vector in the matrix
cos_similiarity = cosine_similarity(vectorized_words, dense_output=False)
# since there are duplicates like (5,0) and (0,5) which we should remove, I use scipy's triu function
coo_cossim = cos_similiarity.tocoo()
vector_similarities = sparse.triu(coo_cossim, k = 1).tocsr()

Yes, csr_matrix.mean() does include all of the zeros when calculating the mean. As a simple example:

from scipy.sparse import csr_matrix

m = csr_matrix(([1,1], ([2,3],[3,3])), shape=(5,5))
m.toarray()

# returns:
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0]], dtype=int32)

# test the mean method
m.mean(), m.mean(axis=0), m.mean(axis=1)

# returns:
0.080000000000000002,
matrix([[ 0. ,  0. ,  0. ,  0.4,  0. ]]),
matrix([[ 0. ],
        [ 0. ],
        [ 0.2],
        [ 0.2],
        [ 0. ]])

If you need to perform a calculation that does not include zeros, you will have to build the result with other methods. It is not terribly hard to do though:

nonzero_mean = m.sum() / m.count_nonzero()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM