简体   繁体   English

计算2D NumPy数组中每行和每列内的非零元素

[英]Counting non-zero elements within each row and within each column of a 2D NumPy array

I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. 我有一个NumPy矩阵,主要包含非零值,但有时会包含零值。 I need to be able to: 我需要能够:

  1. Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process. 计算每行中的非零值,并将该计数放入我可以在后续操作中使用的变量中,可能通过迭代行索引并在迭代过程中执行计算。

  2. Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process. 计算每列中的非零值,并将该计数放入我可以在后续操作中使用的变量中,可能通过迭代列索引并在迭代过程中执行计算。

For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. 例如,我需要做的一件事是对每一行求和,然后将每行和除以每行中的非零值的数量,为每个行索引报告单独的结果。 And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. 然后我需要对每列进行求和,然后将列总和除以列中非零值的数量,同时为每个列索引报告单独的结果。 I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here. 我还需要做其他事情,但在我弄清楚如何处理我在这里列出的内容后,它们应该很容易。

The code I am working with is below. 我正在使用的代码如下。 You can see that I am creating an array of zeros and then populating it from a csv file. 您可以看到我正在创建一个零数组,然后从csv文件中填充它。 Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above. 某些行将包含所有列的值,但其他行仍会在某些最后一列中保留一些零,从而产生上述问题。

The last five lines of the code below are from another posting on this forum. 以下代码的最后五行来自此论坛上的另一个帖子。 These last five lines of code return a printed list of row/column indices for the zeros. 最后五行代码返回零的行/列索引的打印列表。 However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above. 但是,我不知道如何使用该结果信息来创建上述的非零行计数和非零列计数。

ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
    TestID=str(TestIDs[j])
    ReadOrWrite='Read'
    fileName=inputFileName
    directory=GetCurrentDirectory(arguments that return correct directory)
    inputfile=open(directory,'r')
    reader=csv.reader(inputfile)
    m=0
    for row in reader:
        if m<9:
            if row[0]!='TestID':
                ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
                m+=1
    inputfile.close()

IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape) 
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
    print(', '.join(str(p[0]) for p in pt))

Can anyone help me with this? 谁能帮我这个?

import numpy as np

a = np.array([[1, 0, 1],
              [2, 3, 4],
              [0, 0, 7]])

columns = (a != 0).sum(0)
rows    = (a != 0).sum(1)

The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements. 变量(a != 0)是与原始a相同形状的数组,并且对于所有非零元素包含True

The .sum(x) function sums the elements over the axis x . .sum(x)函数对轴x的元素求和。 Sum of True/False elements is the number of True elements. 总和True/False元素是多少True元。

The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array: 变量columnsrows包含原始数组的每列/每行中的非零(元素!= 0)值的数量:

columns = np.array([2, 1, 3])
rows    = np.array([2, 3, 1])

EDIT : The whole code could look like this (with a few simplifications in your original code): 编辑 :整个代码可能看起来像这样(在原始代码中有一些简化):

ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
    ReadOrWrite = 'Read'
    fileName = inputFileName
    directory = GetCurrentDirectory(arguments that return correct directory)
    # use directory or filename to get the CSV file?
    with open(directory, 'r') as csvfile:
        ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]

nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)

EDIT 2 : 编辑2

To get the mean value of all columns/rows, use the following: 要获取所有列/行的平均值,请使用以下命令:

colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)

What do you want to do if there are no non-zero elements in a column/row? 如果列/行中没有非零元素,您想要做什么? Then we can adapt the code to solve such a problem. 然后我们可以调整代码来解决这个问题。

A fast way to count nonzero elements per row in a scipy sparse matrix m is: 在scipy稀疏矩阵m每行计算非零元素的快速方法是:

np.diff(m.tocsr().indptr)

The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. CSR矩阵的indptr属性指示数据内与行之间的边界对应的索引。 So calculating the difference between each entry will provide the number of non-zero elements in each row. 因此,计算每个条目之间的差异将提供每行中非零元素的数量。

Similarly, for the number of nonzero elements in each column, use: 同样,对于每列中的非零元素数,请使用:

np.diff(m.tocsc().indptr)

If the data is already in the appropriate form, these will run in O( m.shape[0] ) and O( m.shape[1] ) respectively, rather than O( m.getnnz() ) in Marat and Finn's solutions. 如果数据已经是适当的形式,它们将分别在O( m.shape[0]O( m.shape[1] )中运行 ,而不是在Marat和Finn的解决方案中运行O( m.getnnz()

If you need both row and column nozero counts, and, say, m is already a CSR, you might use: 如果您需要行和列nozero计数,并且,例如, m已经是CSR,您可以使用:

row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)

which is not asymptotically faster than first converting to CSC (which is O( m.getnnz() ) ) to get col_nonzeros , but is faster because of implementation details. 这并不比第一次转换到CSC(即O( m.getnnz()获得col_nonzeros速度快,但由于实现细节而更快。

The faster way is to clone your matrix with ones instead of real values. 更快的方法是使用1而不是实际值来克隆矩阵。 Then just sum up by rows or columns: 然后只按行或列总结:

X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)

That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53) 对我来说,这比FinnÅrupNielsen的解决方案快了50倍(1秒对53)

edit: Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by 编辑:也许您需要将NumNonZeroElementsByColumn转换为1维数组

np.array(NumNonZeroElementsByColumn)[0]

(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy. (a!= 0)对于我目前的scipy版本中的稀疏矩阵(scipy.sparse.lil_matrix)不起作用。

For sparse matrices I did: 对于稀疏矩阵,我做了:

    (i,j) = X.nonzero()
    column_sums = np.zeros(X.shape[1])
    for n in np.asarray(j).ravel():
        column_sums[n] += 1.

I wonder if there is a more elegant way. 我想知道是否有更优雅的方式。

For sparse matrices, use the getnnz() function supported by CSR/CSC matrix. 对于稀疏矩阵,请使用CSR / CSC矩阵支持的getnnz()函数。

Eg 例如

a = scipy.sparse.csr_matrix([[0, 1, 1], [0, 1, 0]])
a.getnnz(axis=0)

array([0, 2, 1])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM