I have a numpy array for ratings given by users on movies. The rating is between 1 and 5, while 0 means that a user does not rate on a movie. I want to calculate the average rating of each movie, and the average rating of each user. In other words, I will calculate the mean of each column or row of non-zero elements.
Is there an efficient numpy array function to handle this case? I know manually iterating ratings by columns or rows can solve the problem.
Thanks in advance!
Since the values to discard are 0, you can compute the mean manually by doing the sum along an axis and then dividing by the number of non zeros elements (along the same axis):
a = np.array([[8.,9,7,0], [0,0,5,6]])
a.sum(1)/(a != 0).sum(1)
results in:
array([ 8. , 5.5])
as you can see, the zeros are not considered in the mean.
You could make use of np.nanmean
, after converting all 0
values to np.nan
. Note that np.nanmean
is only available in numpy 1.8
.
import numpy as np
ratings = np.array([[1,4,5,0],
[2,0,3,0],
[4,0,0,0]], dtype=np.float)
def get_means(ratings):
ratings[np.where(ratings == 0)] = np.nan
user_means = np.nanmean(ratings, axis=1)
movie_means = np.nanmean(ratings, axis=0)
return {'user_means' : user_means, 'movie_means' : movie_means}
Result:
>>> get_means(ratings)
{'movie_means': array([ 2.33333333, 4. , 4. , nan]),
'user_means': array([ 3.33333333, 2.5 , 4. ])}
Another alternative is to use a masked array, with the 0 values masked. For example (using @Akavali's sample data):
In [30]: ratings = np.array([[1,4,5,0],
....: [2,0,3,0],
....: [4,0,0,0]], dtype=np.float)
Create the masked version of ratings
, using ratings==0
as the mask:
In [31]: mratings = np.ma.masked_array(ratings, mask=ratings==0)
In [32]: mratings
Out[32]:
masked_array(data =
[[1.0 4.0 5.0 --]
[2.0 -- 3.0 --]
[4.0 -- -- --]],
mask =
[[False False False True]
[False True False True]
[False True True True]],
fill_value = 1e+20)
Now compute the mean along each axis:
In [33]: mratings.mean(axis=0)
Out[33]:
masked_array(data = [2.3333333333333335 4.0 4.0 --],
mask = [False False False True],
fill_value = 1e+20)
In [34]: mratings.mean(axis=1)
Out[34]:
masked_array(data = [3.3333333333333335 2.5 4.0],
mask = [False False False],
fill_value = 1e+20)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.