[英]Hamming distance in numpy
我試圖找到一種更快的方法來計算兩個 numpy 數組之間的漢明距離。 可以假設數組的大小為 A(N1 x D) 和 B(N2 x D)
到目前為止我的工作嘗試:
result = np.zeros((A.shape[0], B.shape[0]))
for i in range(A.shape[0]):
for j in range(B.shape[0]):
result[i, j] = np.sum(A[i, :] != B[j, :]) #resulting array is of size (1 x D)
return result
這還不夠快。 我嘗試使用numpy.count_nonzero
而不是sum
,但它引發了以下異常:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
編輯:我忘了提到,數組只包含 1 和 0 值,如果這改變了什么
我的問題是:是否有可能讓它發揮作用?
作為一個額外的問題:為什么numpy.count_nonzero
在我的代碼numpy.count_nonzero
一個數組傳遞給__bool()__
,而不是一個特定的值?
根據@Paul 的建議,我比較了給定大尺寸numpy.ndarray
的兩種方法的時間消耗:
import numpy as np
import time
def binarize(FV):
return np.where(FV > 0, 1, 0).astype(int)
def hammingDist():
a, b = -1, 1
u = (b - a) * np.random.random_sample((3450, 128)) + a
v = (b - a) * np.random.random_sample((3450, 128)) + a
b_t = time.time()
b_u, b_v = binarize(u), binarize(v)
print('binarization time : {} s'.format(time.time()-b_t))
h_slow_t = time.time()
H = np.zeros((b_v.shape[0], b_u.shape[0]))
for i in range(b_v.shape[0]):
for j in range(b_u.shape[0]):
H[i, j] = np.sum(b_v[i, :] != b_u[j, :])
print('H =\n{}'.format(H))
print('t: {} s'.format(time.time()-h_slow_t))
h_f = time.time()
H_fast = np.count_nonzero(b_v[:, None, :] != b_u, axis=2)
print('H_fast =\n{}'.format(H_fast))
print('t: {} s'.format(time.time()-h_f))
if __name__ == "__main__":
hammingDist()
結果:
binarization time : 0.010922908783 s
H =
[[60. 75. 65. ... 66. 56. 66.]
[64. 57. 69. ... 78. 64. 58.]
[62. 63. 65. ... 60. 66. 68.]
...
[60. 63. 69. ... 66. 60. 64.]
[68. 59. 59. ... 52. 62. 74.]
[75. 70. 58. ... 59. 65. 65.]]
t: 53.5885431767 s
H_fast =
[[60 75 65 ... 66 56 66]
[64 57 69 ... 78 64 58]
[62 63 65 ... 60 66 68]
...
[60 63 69 ... 66 60 64]
[68 59 59 ... 52 62 74]
[75 70 58 ... 59 65 65]]
t: 2.6171131134 s
您可以使用 NumPy 廣播或使用 scikit learn 以自己的方式實現這一點。 SciKit 學習是最快的。
import numpy as np
import sklearn.neighbors as sn
N1 = 345
N2 = 3450
D = 128
A = np.random.randint(0, 10, size=(N1, D))
B = np.random.randint(0, 10, size=(N2, D))
def slow(A, B):
result = np.zeros((A.shape[0], B.shape[0]))
for i in range(A.shape[0]):
for j in range(B.shape[0]):
result[i, j] = np.sum(A[i, :] != B[j, :]) #resulting array is of size (1 x D)
return result
def fast(A, B):
return np.count_nonzero(A[:, None, :] != B[None, :, :], axis=-1)
def sklearn(A, B):
return sn.DistanceMetric.get_metric("hamming").pairwise(A, B) * A.shape[-1]
%timeit -r1 -n1 slow(A, B)
# 7.86 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -r1 -n1 fast(A, B)
# 335 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%timeit -r1 -n1 sklearn(A, B)
# 51.1 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
np.allclose(slow(A, B), fast(A, B)) # True
np.allclose(fast(A, B), sklearn(A, B)) # True
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.