一种计算每列或一行非零元素平均值的有效方法

Question

I have a numpy array for ratings given by users on movies. 我有一个numpy数组用于电影用户给出的评级。 The rating is between 1 and 5, while 0 means that a user does not rate on a movie. 评级介于1和5之间，而0表示用户不对电影评分。 I want to calculate the average rating of each movie, and the average rating of each user. 我想计算每部电影的平均评分，以及每个用户的平均评分。 In other words, I will calculate the mean of each column or row of non-zero elements. 换句话说，我将计算每列或一行非零元素的平均值。

Is there an efficient numpy array function to handle this case? 是否有一个高效的numpy数组函数来处理这种情况？ I know manually iterating ratings by columns or rows can solve the problem. 我知道按列或行手动迭代评级可以解决问题。

Thanks in advance! 提前致谢！

Answer 1

Since the values to discard are 0, you can compute the mean manually by doing the sum along an axis and then dividing by the number of non zeros elements (along the same axis): 由于要丢弃的值为0，您可以通过沿轴进行求和然后除以非零元素的数量（沿同一轴）手动计算平均值：

a = np.array([[8.,9,7,0], [0,0,5,6]])
a.sum(1)/(a != 0).sum(1)

results in: 结果是：

array([ 8. ,  5.5])

as you can see, the zeros are not considered in the mean. 正如您所看到的，零不被视为平均值。

Answer 2

You could make use of np.nanmean , after converting all 0 values to np.nan . 在将所有0值转换为np.nanmean之后，您可以使用np.nan 。 Note that np.nanmean is only available in numpy 1.8 . 请注意， np.nanmean仅适用于numpy 1.8 。

import numpy as np

ratings = np.array([[1,4,5,0],
                    [2,0,3,0],
                    [4,0,0,0]], dtype=np.float)


def get_means(ratings):
    ratings[np.where(ratings == 0)] = np.nan

    user_means = np.nanmean(ratings, axis=1)
    movie_means = np.nanmean(ratings, axis=0)

    return {'user_means' : user_means, 'movie_means' : movie_means}

Result: 结果：

>>> get_means(ratings)
{'movie_means': array([ 2.33333333,  4.        ,  4.        ,         nan]), 

'user_means': array([ 3.33333333,  2.5       ,  4.        ])}

Answer 3

Another alternative is to use a masked array, with the 0 values masked. 另一种方法是使用屏蔽数组，屏蔽0值。 For example (using @Akavali's sample data): 例如（使用@ Akavali的示例数据）：

In [30]: ratings = np.array([[1,4,5,0],
   ....:                     [2,0,3,0],
   ....:                     [4,0,0,0]], dtype=np.float)

Create the masked version of ratings , using ratings==0 as the mask: 使用ratings==0作为掩码创建蒙版的ratings ：

In [31]: mratings = np.ma.masked_array(ratings, mask=ratings==0)

In [32]: mratings
Out[32]: 
masked_array(data =
 [[1.0 4.0 5.0 --]
 [2.0 -- 3.0 --]
 [4.0 -- -- --]],
             mask =
 [[False False False  True]
 [False  True False  True]
 [False  True  True  True]],
       fill_value = 1e+20)

Now compute the mean along each axis: 现在计算每个轴的平均值：

In [33]: mratings.mean(axis=0)
Out[33]: 
masked_array(data = [2.3333333333333335 4.0 4.0 --],
             mask = [False False False  True],
       fill_value = 1e+20)

In [34]: mratings.mean(axis=1)
Out[34]: 
masked_array(data = [3.3333333333333335 2.5 4.0],
             mask = [False False False],
       fill_value = 1e+20)

一种计算每列或一行非零元素平均值的有效方法

问题描述

3 个解决方案

解决方案1
8 2014-01-11 02:19:48

解决方案2
5 2014-01-11 03:26:36

解决方案3
2 2014-01-11 04:52:04

一种计算每列或一行非零元素平均值的有效方法

问题描述

3 个解决方案

解决方案1 8 2014-01-11 02:19:48

解决方案2 5 2014-01-11 03:26:36

解决方案3 2 2014-01-11 04:52:04

解决方案1
8 2014-01-11 02:19:48

解决方案2
5 2014-01-11 03:26:36

解决方案3
2 2014-01-11 04:52:04