简体繁体 English

针对多个目标的快速不完美匹配算法

[英]Fast imperfect match algorithm against multiple targets

原文 2016-11-23 01:53:23 3 1 algorithm

Lets say I have a set S , with elements which are N-tuples, ie (xi1, xi2, ... , xin) . 可以说我有一个集合S ，元素为N元组，即(xi1, xi2, ... , xin) 。

Given elements x = (x1, x2, ..., xn) and y = (y1, y2, ..., yn) , matches(x,y,M) if and only if at least M elements of x and y are equal. 给定元素x = (x1, x2, ..., xn)和y = (y1, y2, ..., yn) ，当且仅当x和y至少M元素matches(x,y,M)相等。

Now given a set S , matchSet(x,S,M) returns the elements of S which matches(x,y,M) is true. 现在给定一个集合S ， matchSet(x,S,M)返回matches(x,y,M)为true的S的元素。

Assuming that S has data such that matchSet will on average match only 0 or 1 elements (it will occasionally match more, but rarely), is there a way to write matchSet and structure S so that it's running time is sub linear to the size of S, and it's space is reasonable (ie not putting 2^L indexes on S where L is the length of the elements)? 假设S具有的数据使得matchSet平均仅匹配0或1个元素（它偶尔会匹配更多，但很少匹配），有没有一种方法可以写入matchSet和结构S ，使其运行时间与的大小成线性关系S，并且它的空间是合理的（即不将2^L索引放在S ，其中L是元素的长度）？

Alternatively, a fast running matchManySet(S', S, M) would also be acceptable, which runs matchSet for every element of S' , also long as it takes significantly less time than the size of S times the size of S' . 可替换地，快速运行matchManySet(S', S, M)也将是可接受的，它运行matchSet对的每一个元素S' ，也只要需要显著时间小于大小S倍的大小S' 。

1 个解决方案

This task sounds very interesting for me. 这个任务对我来说听起来很有趣。 I have some idea, unfortunately somebody should test it (I have no time for its implementation). 我有一个主意，不幸的是有人应该对其进行测试（我没有时间实施它）。 Data structure for storing such tuples reminds me suffix tree. 用于存储此类元组的数据结构使我想起后缀树。 (For more information see https://en.wikipedia.org/wiki/Suffix_tree ). （有关更多信息，请参见https://en.wikipedia.org/wiki/Suffix_tree ）。

As example, you can store set Sx in one suffix tree, Sy in other suffix tree. 例如，您可以将集合Sx存储在一个后缀树中，将Sy存储在另一个后缀树中。 In such case, your task boils down to creation result suffix tree via merging given two trees (of course, during merge you should use specific predicate like whether tuples have K occurences). 在这种情况下，通过合并给定的两棵树，您的任务可以归结为创建结果后缀树（当然，在合并期间，您应使用特定的谓词，例如元组是否有K ）。 Overall algorithm complexity will be O(N + M) , where N , M are sizes of input suffix trees. 总体算法复杂度为O(N + M) ，其中N ， M为输入后缀树的大小。