在文本文件中搜索模式的快速算法

Question

I have an array of doubles, roughly 200,000 rows by 100 columns, and I'm looking for a fast algorithm to find the rows that contain sequences most similar to a given pattern (the pattern can be anywhere from 10 to 100 elements). 我有一个双打数组，大约200,000行乘100列，我正在寻找一种快速算法来查找包含与给定模式最相似的序列的行（模式可以是10到100个元素的任何位置）。 I'm using python, so the brute force method (code below: looping over each row and starting column index, and computing the Euclidean distance at each point) takes around three minutes. 我正在使用python，所以蛮力方法（下面的代码：遍历每一行和起始列索引，并计算每个点的欧几里德距离）大约需要三分钟。

The numpy.correlate function promises to solve this problem much faster (running over the same dataset in less than 20 seconds). numpy.correlate函数有望更快地解决这个问题（在不到20秒的时间内在同一个数据集上运行）。 However, it simply computes a sliding dot product of the pattern over the full row, meaning that to compare similarity I'd have to normalize the results first. 然而，它只是计算整个行上的模式的滑点产品，这意味着为了比较相似性，我必须首先将结果标准化。 Normalizing the cross-correlation requires computing the standard deviation of each slice of the data, which instantly negates the speed improvement of using numpy.correlate in the first place. 规范化互相关需要计算每个数据切片的标准偏差，这立即抵消了首先使用numpy.correlate的速度提升。

Is it possible to compute normalized cross-correlation quickly in python? 是否可以在python中快速计算规范化的互相关？ Or will I have to resort to coding the brute force method in C? 或者我是否必须采用C语言编写蛮力方法？

def norm_corr(x,y,mode='valid'):
    ya=np.array(y)
    slices=[x[pos:pos+len(y)] for pos in range(len(x)-len(y)+1)]
    return [np.linalg.norm(np.array(z)-ya) for z in slices]

similarities=[norm_corr(arr,pointarray) for arr in arraytable]

Answer 1

If your data is in a 2D Numpy array, you can take a 2D slice from it (200000 rows by len(pattern) columns) and compute the norm for all the rows at once. 如果您的数据位于2D Numpy数组中，则可以从中获取2D切片（按len（模式）列排列200000行）并一次计算所有行的范数。 Then slide the window to the right in a for loop. 然后在for循环中向右滑动窗口。

ROWS = 200000
COLS = 100
PATLEN = 20
#random data for example's sake
a = np.random.rand(ROWS,COLS)
pattern = np.random.rand(PATLEN)

tmp = np.empty([ROWS, COLS-PATLEN])
for i in xrange(COLS-PATLEN):
    window = a[:,i:i+PATLEN]
    tmp[:,i] = np.sum((window-pattern)**2, axis=1)

result = np.sqrt(tmp)

在文本文件中搜索模式的快速算法

问题描述

1 个解决方案

解决方案1
1 已采纳 2012-02-06 20:47:48

在文本文件中搜索模式的快速算法

问题描述

1 个解决方案

解决方案1 1 已采纳 2012-02-06 20:47:48

解决方案1
1 已采纳 2012-02-06 20:47:48