[英]Measuring the similarity between two irregular plots

我有两个不规则的线作为[x,y]坐标的列表,其中有峰和谷。 列表的长度可能会略有不同(不相等)。 我想测量它们的相似性,以便检查峰高和谷低(相似深度或高度)的出现是否有适当的间隔,并给出相似性度量。 我想用Python做到这一点。 有内置功能可以做到这一点吗?

在此处输入图片说明 在此处输入图片说明


我可以给您列出您可以使用的Python生态系统中可能的功能。 这绝不是功能的完整列表,并且可能有很多我不知道的方法。


  1. 使用定向的Hausdorff距离


  1. 离散Fréchet距离*
  2. 动态时间规整(DTW) *
  3. 部分曲线映射(PCM) **
  4. 曲线长度距离度量(使用从开始到结束的弧长距离) **
  5. 两条曲线之间的面积**



首先,我们假设有两个完全相同的随机XY数据。 请注意,所有这些方法都将返回零。 如果没有,可以从pip安装相似性度量。

import numpy as np
from scipy.spatial.distance import directed_hausdorff
import similaritymeasures
import matplotlib.pyplot as plt

# Generate random experimental data
x = np.random.random(100)
y = np.random.random(100)
P = np.array([x, y]).T

# Generate an exact copy of P, Q, which we will use to compare
Q = P.copy()

dh, ind1, ind2 = directed_hausdorff(P, Q)
df = similaritymeasures.frechet_dist(P, Q)
dtw, d = similaritymeasures.dtw(P, Q)
pcm = similaritymeasures.pcm(P, Q)
area = similaritymeasures.area_between_two_curves(P, Q)
cl = similaritymeasures.curve_length_measure(P, Q)

# all methods will return 0.0 when P and Q are the same
print(dh, df, dtw, pcm, cl, area)



# Generate random experimental data
x = np.random.random(100)
y = np.random.random(100)
P = np.array([x, y]).T

# Generate random Q
x = np.random.random(100)
y = np.random.random(100)
Q = np.array([x, y]).T

dh, ind1, ind2 = directed_hausdorff(P, Q)
df = similaritymeasures.frechet_dist(P, Q)
dtw, d = similaritymeasures.dtw(P, Q)
pcm = similaritymeasures.pcm(P, Q)
area = similaritymeasures.area_between_two_curves(P, Q)
cl = similaritymeasures.curve_length_measure(P, Q)

# all methods will return 0.0 when P and Q are the same
print(dh, df, dtw, pcm, cl, area)


现在,您有许多方法可以比较两条曲线。 我将从DTW开始,因为它已在许多时间序列应用程序中使用,这些应用程序看起来像您上传的数据。


plt.plot(P[:, 0], P[:, 1])
plt.plot(Q[:, 0], Q[:, 1])


我不知道内置功能,但听起来您可以修改Levenshtein的distance Wikibooks的代码采用了以下代码。

def point_distance(p1, p2):
    # Define distance, if they are the same, then the distance should be 0

def levenshtein_point(l1, l2):
    if len(l1) < len(l2):
        return levenshtein(l2, l1)

    # len(l1) >= len(l2)
    if len(l2) == 0:
        return len(l1)

    previous_row = range(len(l2) + 1)
    for i, p1 in enumerate(l1):
        current_row = [i + 1]
        for j, p2 in enumerate(l2):
            print('{},{}'.format(p1, p2))
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than l2
            substitutions = previous_row[j] + point_distance(p1, p2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]

由于数组的大小不相同(并且我假设您使用的是相同的实时时间),因此需要对它们进行插值以在相关点集之间进行比较。 下面的代码执行此操作,并计算相关度量:

import numpy as np
from scipy.interpolate import interp1d
import matplotlib.pyplot as plt
import scipy.spatial.distance as ssd 
import scipy.stats as ss

x = np.linspace(0, 10, num=11)
x2 = np.linspace(1, 11, num=13)

y = 2*np.cos( x) + 4 + np.random.random(len(x))
y2 = 2* np.cos(x2) + 5 + np.random.random(len(x2))

# Interpolating now, using linear, but you can do better based on your data
f = interp1d(x, y)
f2 = interp1d(x2,y2)

points = 15

xnew = np.linspace ( min(x), max(x), num = points) 
xnew2 = np.linspace ( min(x2), max(x2), num = points) 

ynew = f(xnew) 
ynew2 = f2(xnew2) 
plt.plot(x,y, 'r', x2, y2, 'g', xnew, ynew, 'r--', xnew2, ynew2, 'g--')

# Now compute correlations
print ssd.correlation(ynew, ynew2) # Computes a distance measure based on correlation between the two vectors
print np.correlate(ynew, ynew2, mode='valid') # Does a cross-correlation of same sized arrays and gives back correlation
print np.corrcoef(ynew, ynew2) # Gives back the correlation matrix for the two arrays

print ss.spearmanr(ynew, ynew2) # Gives the spearman correlation for the two arrays




[ 363.48984942]

[[ 1.          0.50097173]
 [ 0.50097173  1.        ]]

SpearmanrResult(correlation=0.45357142857142857, pvalue=0.089485900143027278)

请记住,这里的相关性是参数型和皮尔逊型的,并且假定单调性用于计算相关性。 如果不是这种情况,并且您认为数组只是一起改变符号,则可以像上一个示例一样使用Spearman的相关性。


