numpy矩阵行比较

Question

The question is more focused on performance of calculation. 问题更多地集中在计算性能上。

I have 2 matrix with the same number of columns and different number of rows. 我有2个矩阵，它们具有相同的列数和不同的行数。 One matrix is the 'pattern' whose rows have to be compared separately with the other matrix rows (all rows), then to be able to extract statistical values of mean equal to pattern, std,... So, I have the following matrix and the computation is the following one: 一个矩阵是“模式”，其行必须与其他矩阵行（所有行）分别进行比较，然后才能提取等于模式，std等的均值统计值。因此，我有以下矩阵计算如下：

numCols = 10
pattern = np.random.randint(0,2,size=(7,numCols))
matrix = np.random.randint(0,2,size=(5,numCols))

comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    comp_mean[i] = np.mean(np.sum(pattern[i,:] == matrix, axis=1))

print comp_mean # Output example: [ 1.6  1.   1.6  2.2  2.   2.   1.6]

This is clear. 这很清楚。 The problem is that the number of matrix rows of both is much bigger (~1.000.000). 问题在于两者的矩阵行数要大得多（〜1.000.000）。 So this code goes very slow. 因此，这段代码非常慢。 I tryed to implement numpy syntaxis as sometimes it surprises me improving the calculation time. 我尝试实现numpy语法，因为有时它使我惊讶于缩短了计算时间。 So I did the following code (it could be strange, but it works!): 所以我做了下面的代码（这可能很奇怪，但是可以用！）：

comp_mean = np.mean( np.sum( (pattern[np.repeat(np.arange(pattern.shape[0]), matrix.shape[0])].ravel() == np.tile(matrix.ravel(),pattern.shape[0])).reshape(pattern.shape[0],matrix.shape[0],matrix.shape[1]), axis=2 ),axis=1)
print comp_mean

However, this code is slower than the previous one where the 'for' bucle is used. 但是，此代码比以前使用“ for”气泡的代码要慢。 So I would like to know if there is any possibility to speed up the calculation. 因此，我想知道是否有可能加快计算速度。

EDIT 编辑

I have checked the runtime of the different approaches for the real matrix and the result is the following: 我已经检查了实际矩阵的不同方法的运行时，结果如下：

Me - Approach 1: 18.04 seconds 我-方法1： 18.04秒
Me - Approach 2: 303.10 seconds 我-方法2： 303.10秒
Divakar - Approach 1: 18.79 seconds Divakar-方法1： 18.79秒
Divakar - Approach 2: 65.11 seconds Divakar-方法2： 65.11秒
Divakar - Approach 3.1: 137.78 seconds Divakar-方法3.1： 137.78秒
Divakar - Approach 3.2: 59.59 seconds Divakar-方法3.2： 59.59秒
Divakar - Approach 4: 6.06 seconds Divakar-方法4： 6.06秒

EDIT(2) EDIT（2）

Previous runs where performed in a laptop. 在笔记本电脑中执行以前的运行。 I have run the code on a desktop. 我已经在桌面上运行了代码。 I have avoided the worst results, and the new runtimes are now different: 我避免了最糟糕的结果，并且新的运行时现在有所不同：

Me - Approach 1: 6.25 seconds 我-方法1： 6.25秒
Divakar - Approach 1: 4.01 seconds Divakar-方法1： 4.01秒
Divakar - Approach 2: 3.66 seconds Divakar-方法2： 3.66秒
Divakar - Approach 4: 3.12 seconds Divakar-方法4： 3.12秒

Answer 1

Few approaches with broadcasting could be suggested here. 这里很少提出broadcasting方法。

Approach #1 方法1

out = np.mean(np.sum(pattern[:,None,:] == matrix[None,:,:],2),1)

Approach #2 方法＃2

mrows = matrix.shape[0]
prows = pattern.shape[0]
out = (pattern[:,None,:] == matrix[None,:,:]).reshape(prows,-1).sum(1)/mrows

Approach #3 方法3

mrows = matrix.shape[0]
prows = pattern.shape[0]
out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:]).astype(int))/mrows
# OR out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:])+0)/mrows

Approach #4 方法＃4

If the number of rows in matrix is a huge number, it could be better to stick to a for-loop to avoid the huge memory requirements for such a case, that might also lead to slow runtimes. 如果matrix的行数很大，那么最好坚持使用for循环以避免这种情况下的巨大内存需求，这也可能导致运行时间变慢。 Instead, we could do some optimizations within each loop iteration. 相反，我们可以在每次循环迭代中进行一些优化。 Here's one such possible optimization shown - 这是所示的一种可能的优化方法-

mrows = matrix.shape[0]
comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    comp_mean[i] = (pattern[i,:] == matrix).sum()
comp_mean = comp_mean/mrows

Answer 2

could you have a try at this: 您可以尝试一下：

import scipy.ndimage.measurements

comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    m = scipy.ndimage.measurements.histogram(matrix,0,1,2,pattern[i],[0,1])
    comp_mean[i] = m[0][0]+m[1][1]
comp_mean /= matrix.shape[0]

Regards. 问候。

numpy矩阵行比较

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-06-12 09:34:51

解决方案2
0 2015-06-12 13:07:50

numpy矩阵行比较

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-06-12 09:34:51

解决方案2 0 2015-06-12 13:07:50

解决方案1
3 已采纳 2015-06-12 09:34:51

解决方案2
0 2015-06-12 13:07:50