An efficient way to calculate the mean of each column or row of non-zero elements

Question

I have a numpy array for ratings given by users on movies. The rating is between 1 and 5, while 0 means that a user does not rate on a movie. I want to calculate the average rating of each movie, and the average rating of each user. In other words, I will calculate the mean of each column or row of non-zero elements.

Is there an efficient numpy array function to handle this case? I know manually iterating ratings by columns or rows can solve the problem.

Thanks in advance!

Answer 1

Since the values to discard are 0, you can compute the mean manually by doing the sum along an axis and then dividing by the number of non zeros elements (along the same axis):

a = np.array([[8.,9,7,0], [0,0,5,6]])
a.sum(1)/(a != 0).sum(1)

results in:

array([ 8. ,  5.5])

as you can see, the zeros are not considered in the mean.

Answer 2

You could make use of np.nanmean , after converting all 0 values to np.nan . Note that np.nanmean is only available in numpy 1.8 .

import numpy as np

ratings = np.array([[1,4,5,0],
                    [2,0,3,0],
                    [4,0,0,0]], dtype=np.float)


def get_means(ratings):
    ratings[np.where(ratings == 0)] = np.nan

    user_means = np.nanmean(ratings, axis=1)
    movie_means = np.nanmean(ratings, axis=0)

    return {'user_means' : user_means, 'movie_means' : movie_means}

Result:

>>> get_means(ratings)
{'movie_means': array([ 2.33333333,  4.        ,  4.        ,         nan]), 

'user_means': array([ 3.33333333,  2.5       ,  4.        ])}

Answer 3

Another alternative is to use a masked array, with the 0 values masked. For example (using @Akavali's sample data):

In [30]: ratings = np.array([[1,4,5,0],
   ....:                     [2,0,3,0],
   ....:                     [4,0,0,0]], dtype=np.float)

Create the masked version of ratings , using ratings==0 as the mask:

In [31]: mratings = np.ma.masked_array(ratings, mask=ratings==0)

In [32]: mratings
Out[32]: 
masked_array(data =
 [[1.0 4.0 5.0 --]
 [2.0 -- 3.0 --]
 [4.0 -- -- --]],
             mask =
 [[False False False  True]
 [False  True False  True]
 [False  True  True  True]],
       fill_value = 1e+20)

Now compute the mean along each axis:

In [33]: mratings.mean(axis=0)
Out[33]: 
masked_array(data = [2.3333333333333335 4.0 4.0 --],
             mask = [False False False  True],
       fill_value = 1e+20)

In [34]: mratings.mean(axis=1)
Out[34]: 
masked_array(data = [3.3333333333333335 2.5 4.0],
             mask = [False False False],
       fill_value = 1e+20)

An efficient way to calculate the mean of each column or row of non-zero elements

Question

3 answers

solution1
8 2014-01-11 02:19:48

solution2
5 2014-01-11 03:26:36

solution3
2 2014-01-11 04:52:04

An efficient way to calculate the mean of each column or row of non-zero elements

Question

3 answers

solution1 8 2014-01-11 02:19:48

solution2 5 2014-01-11 03:26:36

solution3 2 2014-01-11 04:52:04

solution1
8 2014-01-11 02:19:48

solution2
5 2014-01-11 03:26:36

solution3
2 2014-01-11 04:52:04