简体   繁体   English

numpy矩阵行比较

[英]Numpy matrix row comparison

The question is more focused on performance of calculation. 问题更多地集中在计算性能上。

I have 2 matrix with the same number of columns and different number of rows. 我有2个矩阵,它们具有相同的列数和不同的行数。 One matrix is the 'pattern' whose rows have to be compared separately with the other matrix rows (all rows), then to be able to extract statistical values of mean equal to pattern, std,... So, I have the following matrix and the computation is the following one: 一个矩阵是“模式”,其行必须与其他矩阵行(所有行)分别进行比较,然后才能提取等于模式,std等的均值统计值。因此,我有以下矩阵计算如下:

numCols = 10
pattern = np.random.randint(0,2,size=(7,numCols))
matrix = np.random.randint(0,2,size=(5,numCols))

comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    comp_mean[i] = np.mean(np.sum(pattern[i,:] == matrix, axis=1))

print comp_mean # Output example: [ 1.6  1.   1.6  2.2  2.   2.   1.6]

This is clear. 这很清楚。 The problem is that the number of matrix rows of both is much bigger (~1.000.000). 问题在于两者的矩阵行数要大得多(〜1.000.000)。 So this code goes very slow. 因此,这段代码非常慢。 I tryed to implement numpy syntaxis as sometimes it surprises me improving the calculation time. 我尝试实现numpy语法,因为有时它使我惊讶于缩短了计算时间。 So I did the following code (it could be strange, but it works!): 所以我做了下面的代码(这可能很奇怪,但是可以用!):

comp_mean = np.mean( np.sum( (pattern[np.repeat(np.arange(pattern.shape[0]), matrix.shape[0])].ravel() == np.tile(matrix.ravel(),pattern.shape[0])).reshape(pattern.shape[0],matrix.shape[0],matrix.shape[1]), axis=2 ),axis=1)
print comp_mean

However, this code is slower than the previous one where the 'for' bucle is used. 但是,此代码比以前使用“ for”气泡的代码要慢。 So I would like to know if there is any possibility to speed up the calculation. 因此,我想知道是否有可能加快计算速度。

EDIT 编辑

I have checked the runtime of the different approaches for the real matrix and the result is the following: 我已经检查了实际矩阵的不同方法的运行时,结果如下:

  • Me - Approach 1: 18.04 seconds 我-方法1: 18.04
  • Me - Approach 2: 303.10 seconds 我-方法2: 303.10
  • Divakar - Approach 1: 18.79 seconds Divakar-方法1: 18.79
  • Divakar - Approach 2: 65.11 seconds Divakar-方法2: 65.11
  • Divakar - Approach 3.1: 137.78 seconds Divakar-方法3.1: 137.78
  • Divakar - Approach 3.2: 59.59 seconds Divakar-方法3.2: 59.59
  • Divakar - Approach 4: 6.06 seconds Divakar-方法4: 6.06

EDIT(2) EDIT(2)

Previous runs where performed in a laptop. 在笔记本电脑中执行以前的运行。 I have run the code on a desktop. 我已经在桌面上运行了代码。 I have avoided the worst results, and the new runtimes are now different: 我避免了最糟糕的结果,并且新的运行时现在有所不同:

  • Me - Approach 1: 6.25 seconds 我-方法1: 6.25
  • Divakar - Approach 1: 4.01 seconds Divakar-方法1: 4.01
  • Divakar - Approach 2: 3.66 seconds Divakar-方法2: 3.66
  • Divakar - Approach 4: 3.12 seconds Divakar-方法4: 3.12

Few approaches with broadcasting could be suggested here. 这里很少提出broadcasting方法。

Approach #1 方法1

out = np.mean(np.sum(pattern[:,None,:] == matrix[None,:,:],2),1)

Approach #2 方法#2

mrows = matrix.shape[0]
prows = pattern.shape[0]
out = (pattern[:,None,:] == matrix[None,:,:]).reshape(prows,-1).sum(1)/mrows

Approach #3 方法3

mrows = matrix.shape[0]
prows = pattern.shape[0]
out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:]).astype(int))/mrows
# OR out = np.einsum('ijk->i',(pattern[:,None,:] == matrix[None,:,:])+0)/mrows

Approach #4 方法#4

If the number of rows in matrix is a huge number, it could be better to stick to a for-loop to avoid the huge memory requirements for such a case, that might also lead to slow runtimes. 如果matrix的行数很大,那么最好坚持使用for循环以避免这种情况下的巨大内存需求,这也可能导致运行时间变慢。 Instead, we could do some optimizations within each loop iteration. 相反,我们可以在每次循环迭代中进行一些优化。 Here's one such possible optimization shown - 这是所示的一种可能的优化方法-

mrows = matrix.shape[0]
comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    comp_mean[i] = (pattern[i,:] == matrix).sum()
comp_mean = comp_mean/mrows

could you have a try at this: 您可以尝试一下:

import scipy.ndimage.measurements

comp_mean = np.zeros(pattern.shape[0])
for i in range(pattern.shape[0]):
    m = scipy.ndimage.measurements.histogram(matrix,0,1,2,pattern[i],[0,1])
    comp_mean[i] = m[0][0]+m[1][1]
comp_mean /= matrix.shape[0]

Regards. 问候。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM