仅当满足每行元素上的条件时，才计算2D数组特定列的均值和方差

Question

I have a 2D numpy array with dimension (690L, 15L). 我有一个尺寸为（690L，15L）的2D numpy数组。 I need to compute a columns wise mean on this dataset only in some particolar columns, but with a condition: I need to include a row if and only if an element in the same row at specific column satisfy a condition. 我只需要在某些奇数列中对此数据集计算列的均值，但要满足以下条件：当且仅当特定列中同一行中的元素满足条件时，才需要包含一行。 Let's me more cleare with some code. 让我用一些代码清除更多信息。

f = open("data.data")
dataset =  np.loadtxt(fname = f, delimiter = ',')

I have array with fullfilled with indexes where I need to perform mean (and variance) 我有充满指标的数组，我需要执行均值（和方差）

index_catego = [0, 3, 4, 5, 7, 8, 10, 11]

The condition is that the dataset[i, 14] == 1 As output I want an 1D array with length like len(index_catego) where each element of this array is the mean of the previously columns 条件是， dataset[i, 14] == 1作为输出，我想要一个长度为len(index_catego)的一维数组，其中该数组的每个元素都是len(index_catego)的均值

output = [mean_of_index_0, mean_of_index_3, ..., mean_of_index_11]

I am using Python recently but I am sure there is a cool way of doing this with np.where , mask , np.mean or something else. 我最近正在使用Python，但是我敢肯定有一种很酷的方法可以通过np.where ， mask ， np.mean或其他方式实现。

I already implement a solution, but it is not elegant and I am not sure if it is correct. 我已经实现了一个解决方案，但是它并不优雅，并且不确定是否正确。

import numpy as np

index_catego = [0, 3, 4, 5, 7, 8, 10, 11]

matrix_mean_positive = []
matrix_variance_positive = []
matrix_mean_negative = []
matrix_variance_negative = []

n_positive = 0
n_negative = 0

sum_positive = np.empty(len(index_catego))
sum_negative = np.empty(len(index_catego))


for i in range(dataset.shape[0]):
    if dataset[i, 14] == 0:
        n_positive = n_positive + 1
        j = 0
        for k in index_catego:
            sum_positive[j] = sum_positive[j] + dataset[i, k]
            j = j + 1
    else:
        n_negative = n_negative + 1
        j = 0
        for k in index_catego:
            sum_negative[j] = sum_negative[j] + dataset[i, k]
            j = j + 1

for item in np.nditer(sum_positive):
    matrix_mean_positive.append(item / n_positive)

for item in np.nditer(sum_negative):
    matrix_mean_negative.append(item / n_negative)

print(matrix_mean_positive)
print(matrix_mean_negative)

If you wanna try your solution, I put some data example 如果您想尝试解决方案，我会举一些数据示例

1,22.08,11.46,2,4,4,1.585,0,0,0,1,2,100,1213,0
0,22.67,7,2,8,4,0.165,0,0,0,0,2,160,1,0
0,29.58,1.75,1,4,4,1.25,0,0,0,1,2,280,1,0
0,21.67,11.5,1,5,3,0,1,1,11,1,2,0,1,1
1,20.17,8.17,2,6,4,1.96,1,1,14,0,2,60,159,1
0,15.83,0.585,2,8,8,1.5,1,1,2,0,2,100,1,1
1,17.42,6.5,2,3,4,0.125,0,0,0,0,2,60,101,0

Thanks for you help. 感谢您的帮助。

UPDATE 1: I tried with this 更新1：我尝试过

output_positive = dataset[:, index_catego][dataset[:, 14] == 0]
mean_p = output_positive.mean(axis = 0)
print(mean_p)

output_negative = dataset[:, index_catego][dataset[:, 14] == 1]
mean_n = output_negative.mean(axis = 0)
print(mean_n)

but means computed by the first (solution not cool) and the second solution (one line cool solotion) are all different. 但是通过第一个解决方案（非冷却溶液）和第二个解决方案（单线冷却溶液）计算出的均值不同。 I checked what dataset[:, index_catego][dataset[:, 14] == 0] and dataset[:, index_catego][dataset[:, 14] == 1] select and seems correct (right dimension and right element). 我检查了选择了什么dataset[:, index_catego][dataset[:, 14] == 0]和dataset[:, index_catego][dataset[:, 14] == 1] （正确的尺寸和正确的元素）。

UPDATE 2: Ok, the first solution is wrong because (for example) the first column have as element only 0 and 1, but as mean return a value > 1. I do not know where I failed. 更新2：好的，第一个解决方案是错误的，因为（例如）第一列只有0和1作为元素，但作为平均值返回值>1。我不知道在哪里失败。 Seems that the positive class is correct (or at least plausible), while negative class are not even plausible. 似乎肯定的类别是正确的（或至少是合理的），而否定的类别甚至是不合理的。

So, is it second solution correct? 那么，第二种解决方案正确吗？ Is there a better way of doing it? 有更好的方法吗？

UPDATE 3: I think I found the problem with the first solution: I am using jupyter notebook and sometimes (not all the times) when I rerun the same cell where the first solution is, element in matrix_mean_positive and matrix_mean_negative are doubled. 更新3：我认为我发现第一个解决方案的问题：我正在使用jupyter笔记本，有时（并非所有时候）当我重新运行第一个解决方案所在的单元格时， matrix_mean_positive和matrix_mean_negative中的元素加倍了。 If someone know why, could be point me? 如果有人知道为什么，可以指出我吗？

Now both solution return the same means. 现在，两种解决方案都返回相同的方法。

Answer 1

在重新运行之前在Jupyter Notebook中执行内核->重新启动以清除内存

仅当满足每行元素上的条件时，才计算2D数组特定列的均值和方差

问题描述

1 个解决方案

解决方案1
0 2018-09-15 12:39:47

仅当满足每行元素上的条件时，才计算2D数组特定列的均值和方差

问题描述

1 个解决方案

解决方案1 0 2018-09-15 12:39:47

解决方案1
0 2018-09-15 12:39:47