[英]An efficient way to calculate the mean of each column or row of non-zero elements
I have a numpy array for ratings given by users on movies. 我有一个numpy数组用于电影用户给出的评级。 The rating is between 1 and 5, while 0 means that a user does not rate on a movie. 评级介于1和5之间,而0表示用户不对电影评分。 I want to calculate the average rating of each movie, and the average rating of each user. 我想计算每部电影的平均评分,以及每个用户的平均评分。 In other words, I will calculate the mean of each column or row of non-zero elements. 换句话说,我将计算每列或一行非零元素的平均值。
Is there an efficient numpy array function to handle this case? 是否有一个高效的numpy数组函数来处理这种情况? I know manually iterating ratings by columns or rows can solve the problem. 我知道按列或行手动迭代评级可以解决问题。
Thanks in advance! 提前致谢!
Since the values to discard are 0, you can compute the mean manually by doing the sum along an axis and then dividing by the number of non zeros elements (along the same axis): 由于要丢弃的值为0,您可以通过沿轴进行求和然后除以非零元素的数量(沿同一轴)手动计算平均值:
a = np.array([[8.,9,7,0], [0,0,5,6]])
a.sum(1)/(a != 0).sum(1)
results in: 结果是:
array([ 8. , 5.5])
as you can see, the zeros are not considered in the mean. 正如您所看到的,零不被视为平均值。
You could make use of np.nanmean
, after converting all 0
values to np.nan
. 在将所有0
值转换为np.nanmean
之后,您可以使用np.nan
。 Note that np.nanmean
is only available in numpy 1.8
. 请注意, np.nanmean
仅适用于numpy 1.8
。
import numpy as np
ratings = np.array([[1,4,5,0],
[2,0,3,0],
[4,0,0,0]], dtype=np.float)
def get_means(ratings):
ratings[np.where(ratings == 0)] = np.nan
user_means = np.nanmean(ratings, axis=1)
movie_means = np.nanmean(ratings, axis=0)
return {'user_means' : user_means, 'movie_means' : movie_means}
Result: 结果:
>>> get_means(ratings)
{'movie_means': array([ 2.33333333, 4. , 4. , nan]),
'user_means': array([ 3.33333333, 2.5 , 4. ])}
Another alternative is to use a masked array, with the 0 values masked. 另一种方法是使用屏蔽数组,屏蔽0值。 For example (using @Akavali's sample data): 例如(使用@ Akavali的示例数据):
In [30]: ratings = np.array([[1,4,5,0],
....: [2,0,3,0],
....: [4,0,0,0]], dtype=np.float)
Create the masked version of ratings
, using ratings==0
as the mask: 使用ratings==0
作为掩码创建蒙版的ratings
:
In [31]: mratings = np.ma.masked_array(ratings, mask=ratings==0)
In [32]: mratings
Out[32]:
masked_array(data =
[[1.0 4.0 5.0 --]
[2.0 -- 3.0 --]
[4.0 -- -- --]],
mask =
[[False False False True]
[False True False True]
[False True True True]],
fill_value = 1e+20)
Now compute the mean along each axis: 现在计算每个轴的平均值:
In [33]: mratings.mean(axis=0)
Out[33]:
masked_array(data = [2.3333333333333335 4.0 4.0 --],
mask = [False False False True],
fill_value = 1e+20)
In [34]: mratings.mean(axis=1)
Out[34]:
masked_array(data = [3.3333333333333335 2.5 4.0],
mask = [False False False],
fill_value = 1e+20)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.