简体   繁体   English

通过产生互相关矩阵来测试多个数据集的相似性

[英]Testing similarity of several datasets by producing a cross-correlation matrix

I am trying to compare several datasets and basically test, if they show the same feature, although this feature might be shifted, reversed or attenuated. 我正在尝试比较多个数据集并基本上测试它们是否显示相同的功能,尽管该功能可能会发生偏移,反转或衰减。 A very simple example below: 下面是一个非常简单的示例:

A = np.array([0., 0, 0, 1., 2., 3., 4., 3, 2, 1, 0, 0, 0])
B = np.array([0., 0, 0, 0, 0, 1, 2., 3., 4, 3, 2, 1, 0])
C = np.array([0., 0, 0, 1, 1.5, 2, 1.5, 1, 0, 0, 0, 0, 0])
D = np.array([0., 0, 0, 0, 0, -2, -4, -2, 0, 0, 0, 0, 0])
x = np.arange(0,len(A),1)

看起来像这样

I thought the best way to do it would be to normalize these signals and get absolute values (their attenuation is not important for me at this stage, I am interested in the position... but I might be wrong, so I will welcome thoughts about this concept too) and calculate the area where they overlap. 我认为最好的方法是将这些信号归一化并获得绝对值(在现阶段它们的衰减对我而言并不重要,我对该位置感兴趣...但是我可能错了,所以我欢迎您提出想法)关于这个概念),并计算出它们重叠的面积。 I am following up on this answer - the solution looked very elegant and simple, but I may be implementing it wrongly. 我正在回答这个问题 -解决方案看起来非常优雅和简单,但是我可能错误地实现了它。

def normalize(sig):
    #ns = sig/max(np.abs(sig))
    ns = sig/sum(sig)
    return ns
a = normalize(A)
b = normalize(B)
c = normalize(C)
d = normalize(D)

which then look like this: 然后看起来像这样: ![规范化

But then, when I try to implement the solution from the answer, I run into problems. 但是,当我尝试从答案中实施解决方案时,我遇到了问题。

OLD

for c1,w1 in enumerate([a,b,c,d]):
    for c2,w2 in enumerate([a,b,c,d]):
        w1 = np.abs(w1)
        w2 = np.abs(w2)
        M[c1,c2] = integrate.trapz(min(np.abs(w2).any(),np.abs(w1).any()))
print M

Produces TypeError: 'numpy.bool_' object is not iterable or IndexError: list assignment index out of range . 产生TypeError: 'numpy.bool_' object is not iterableIndexError: list assignment index out of range But I only included the .any() because without them, I was getting the ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() 但是我只包含.any()因为没有它们,我得到了ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() . ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

EDIT - NEW (thanks @Kody King) 编辑-新 (感谢@Kody King)

The new code is now: 现在,新代码为:

M = np.zeros([4,4])
SH = np.zeros([4,4])
for c1,w1 in enumerate([a,b,c,d]):
    for c2,w2 in enumerate([a,b,c,d]):
        crossCorrelation = np.correlate(w1,w2, 'full')
        bestShift = np.argmax(crossCorrelation)

        # This reverses the effect of the padding.
        actualShift = bestShift - len(w2) + 1
        similarity = crossCorrelation[bestShift]

        M[c1,c2] = similarity
        SH[c1,c2] = actualShift
M = M/M.max()
print M, '\n', SH

And the output: 并输出:

[[ 1.          1.          0.95454545  0.63636364]
 [ 1.          1.          0.95454545  0.63636364]
 [ 0.95454545  0.95454545  0.95454545  0.63636364]
 [ 0.63636364  0.63636364  0.63636364  0.54545455]] 
[[ 0. -2.  1.  0.]
 [ 2.  0.  3.  2.]
 [-1. -3.  0. -1.]
 [ 0. -2.  1.  0.]]

The matrix of shifts looks ok now, but the actual correlation matrix does not. 位移矩阵现在看起来不错,但实际的相关矩阵却不行。 I am really puzzled by the fact that the lowest correlation value is for correlating d with itself. 最低的相关值是用于将d与自身相关的事实,我感到非常困惑。 What I would like to achieve now is that: 我现在想实现的是:


EDIT - UPDATE 编辑-更新

Following on the advice, I used the recommended normalization formula (dividing the signal by its sum), but the problem wasn't solved, just reversed. 根据建议,我使用了推荐的归一化公式(将信号除以信号总和),但是问题并没有解决,只是倒过来了。 Now the correlation of d with d is 1, but all the other signals don't correlate with themselves. 现在d与d的相关性为1,但所有其他信号都不与它们自己相关。

New output: 新的输出:

[[ 0.45833333  0.45833333  0.5         0.58333333]
 [ 0.45833333  0.45833333  0.5         0.58333333]
 [ 0.5         0.5         0.57142857  0.66666667]
 [ 0.58333333  0.58333333  0.66666667  1.        ]] 
[[ 0. -2.  1.  0.]
 [ 2.  0.  3.  2.]
 [-1. -3.  0. -1.]
 [ 0. -2.  1.  0.]]

  1. The correlation value should be highest for correlating a signal with itself (ie to have the highest values on the main diagonal). 为了使信号与其自身相关,相关值应该最高(即在主对角线上具有最高值)。
  2. To get the correlation values in the range between 0 and 1, so as a result, I would have 1s on the main diagonal and other numbers (0.x) elsewhere. 为了获得0到1之间的相关值,因此,我在主对角线上将有1s,而在其他地方将有其他数字(0.x)。

I was hoping the M = M/M.max() would do the job, but only if condition no. 我希望M = M / M.max()可以完成这项工作,但前提是没有条件。 1 is fulfilled, which it currently isn't. 1已实现,当前尚未实现。

As ssm said numpy's correlate function works well for this problem. 正如ssm所说,numpy的相关函数很好地解决了这个问题。 You mentioned that you are interested in the position. 您提到您对该职位感兴趣。 The correlate function can also help you tell how far one sequence is shifted from another. 相关函数还可以帮助您判断一个序列与另一个序列之间的距离。

import numpy as np

def compare(a, b):
    # 'full' pads the sequences with 0's so they are correlated
    # with as little as 1 actual element overlapping.
    crossCorrelation = np.correlate(a,b, 'full')
    bestShift = np.argmax(crossCorrelation)

    # This reverses the effect of the padding.
    actualShift = bestShift - len(b) + 1
    similarity = crossCorrelation[bestShift]

    print('Shift: ' + str(actualShift))
    print('Similatiy: ' + str(similarity))
    return {'shift': actualShift, 'similarity': similarity}

print('\nExpected shift: 0')
compare([0,0,1,0,0], [0,0,1,0,0])
print('\nExpected shift: 2')
compare([0,0,1,0,0], [1,0,0,0,0])
print('\nExpected shift: -2')
compare([1,0,0,0,0], [0,0,1,0,0])

Edit: 编辑:

You need to normalize each sequence before correlating them, or the larger sequences will have a very high correlation with the all the other sequences. 您需要在关联每个序列之前对其进行归一化,否则较大的序列将与所有其他序列具有非常高的相关性。

A property of cross-correlation is that: 互相关的一个属性是:

胶乳

So if you normalize by dividing each sequence by it's sum, the similarity will always be between 0 and 1. 因此,如果将每个序列除以总和进行归一化,则相似度将始终在0到1之间。

I recommend you don't take the absolute value of a sequence. 我建议您不要采用序列的绝对值。 That changes the shape, not just the scale. 这改变了形状,而不仅仅是比例。 For instance np.abs([1, -2]) == [1, 2]. 例如np.abs([1,-2])== [1,2]。 Normalizing will already ensure that sequence is mostly positive and adds up to 1. 规范化已经可以确保序列大部分为正,并加1。

Second Edit: 第二编辑:

I had a realization. 我意识到了 Think of the signals as vectors. 将信号视为矢量。 Normalized vectors always have a max dot product with themselves. 归一化向量始终具有最大点积。 Cross-Correlation is just a dot product calculated at various shifts. 互相关只是在各种偏移下计算出的点积。 If you normalize the signals like you would a vector (divide s by sqrt(s dot s)), the self correlations will always be maximal and 1. 如果像矢量一样对信号进行归一化(将s除以sqrt(s点s)),则自相关将始终最大且为1。

import numpy as np

def normalize(s):
    magSquared = np.correlate(s, s) # s dot itself
    return s / np.sqrt(magSquared)

a = np.array([0., 0, 0, 1., 2., 3., 4., 3, 2, 1, 0, 0, 0])
b = np.array([0., 0, 0, 0, 0, 1, 2., 3., 4, 3, 2, 1, 0])
c = np.array([0., 0, 0, 1, 1.5, 2, 1.5, 1, 0, 0, 0, 0, 0])
d = np.array([0., 0, 0, 0, 0, -2, -4, -2, 0, 0, 0, 0, 0])

a = normalize(a)
b = normalize(b)
c = normalize(c)
d = normalize(d)

M = np.zeros([4,4])
SH = np.zeros([4,4])
for c1,w1 in enumerate([a,b,c,d]):
    for c2,w2 in enumerate([a,b,c,d]):
        # Taking the absolute value catches signals which are flipped.
        crossCorrelation = np.abs(np.correlate(w1, w2, 'full'))
        bestShift = np.argmax(crossCorrelation)

        # This reverses the effect of the padding.
        actualShift = bestShift - len(w2) + 1
        similarity = crossCorrelation[bestShift]

        M[c1,c2] = similarity
        SH[c1,c2] = actualShift
print(M, '\n', SH)

Outputs: 输出:

[[ 1.          1.          0.97700842  0.86164044]
[ 1.          1.          0.97700842  0.86164044]
[ 0.97700842  0.97700842  1.          0.8819171 ]
[ 0.86164044  0.86164044  0.8819171   1.        ]]
[[ 0. -2.  1.  0.]
[ 2.  0.  3.  2.]
[-1. -3.  0. -1.]
[ 0. -2.  1.  0.]]

You want to use a cross-correlation between the vectors: 您要在向量之间使用互相关:

For example: 例如:

>>> np.correlate(A,B)
array([ 31.])

>>> np.correlate(A,C)
array([ 19.])

>>> np.correlate(A,D)
array([-28.])

If you don't care about the sign, you can simply take the absolute value ... 如果您不关心符号,则可以简单地取绝对值...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM