简体繁体 English

多级串联的同步模式匹配算法

[英]Synchronous pattern matching algorithm for multiple concatenated strings

原文 2012-10-09 08:06:29 3 2 c++/ string/ algorithm

For a class subject, I must implement a class that looks for a pattern in a set of chars that the class receives in a chronological order. 对于类主题，我必须实现一个类，该类在按时间顺序接收的一组字符中查找模式。 Each character the class receives has a particular source (a planete, identified by an int ID). 类接收的每个字符都有一个特定的源（一个由int ID标识的planete）。

We have to implement the data structure ourselves, and so I implemented a String List where I store all these characters in a chronological order. 我们必须自己实现数据结构，因此我实现了一个String List，我按时间顺序存储所有这些字符。

The problem is that the pattern must be matched for characters coming from the same planete (source), so pattern matching must be made on each source. 问题是必须匹配来自同一个planete（源）的字符的模式，因此必须在每个源上进行模式匹配。

I tried to use famous pattern matching algorithms like Rabin Karp by browsing the whole list and only taking into account the currently browsed source, and then doing this for all the sources, but the performances are really lame, even worse than a naive (but synchronous) solution. 我尝试使用着名的模式匹配算法，如Rabin Karp，浏览整个列表，只考虑当前浏览的来源，然后对所有来源做这个，但表现真的很蹩脚，甚至比天真（但同步）更糟糕）解决方案。

Do you have any idea about which algorithm could be more efficient in that case ? 您是否知道在这种情况下哪种算法更有效？ (letting me use each character I'm browsing, even if this implies storing the actual "search state" of that source somewhere, like we did for the naive implementation) （让我使用我正在浏览的每个角色，即使这意味着在某处存储该源的实际“搜索状态”，就像我们为天真的实现所做的那样）

PS: The IDs are finite (from 1 to 128) but the number of chars can go up to 10⁷ PS：ID是有限的（从1到128），但字符数可以达到10⁷

EDIT: Here are some details that will hopefully clarify things. 编辑：这里有一些细节，希望澄清事情。

IntlFinder , my class,can receive characters (or array of characters) by a method Add(char* pszData, int nSource) ; IntlFinder ，我的类，可以通过方法Add(char* pszData, int nSource) ）接收字符（或字符数组Add(char* pszData, int nSource) ; Hence, each character is coupled with a Source ID. 因此，每个字符都与源ID相关联。 The pair (character, source) is stored in a StringList ComList (in chronological order of their addition). 该对（字符，源）存储在StringList ComList （按其添加的时间顺序）。

For the pattern to be present in my class, it must be present for THE SAME SOURCE. 对于我班级中存在的模式，它必须出现在同一个源中。

Example: 例：

If I'm looking for the pattern SAYKOUK 如果我正在寻找SAYKOUK模式

( S , 1); （ S ，1）; ( A , 1); （ A ，1）; ( Y , 1); （ Y ，1）; ( K , 1); （ K ，1）; (Z, 2); （Z，2）; (S, 3); （S，3）; ( O , 1); （ O ，1）; ( U , 1); （ U ，1）; ( K , 1) is OK ! （ K ，1）没问题！

( S , 1); （ S ，1）; ( A , 1); （ A ，1）; ( Y , 1); （ Y ，1）; (K, 2); （K，2）; (O, 3); （O，3）; (U, 1); （U，1）; (K, 4) is not OK. （K，4）不行。

This is problametic because if I only consider one source (ranging from 1 to 128) and browse the whole list each time, my pattern searching method is REALLY slow. 这是一个问号，因为如果我只考虑一个源（范围从1到128）并且每次浏览整个列表，我的模式搜索方法真的很慢。 And I can't manage with any of these algorithms to take into account the characters of the different sources and know whenever I met my pattern with any of them ! 我无法使用任何这些算法来考虑不同来源的角色，并且每当我遇到任何一个模式时都知道！

2 个解决方案

解决方案是为每个源存储单独的字符列表，然后分别在这些列表中查找模式。

I ended up using a linked list with the classical "next" and "previous" pointers but also "nextSource" and "previousSource" that points to the characters of the same source. 我最终使用链接列表与经典的“下一个”和“前一个”指针，但也指向“nextSource”和“previousSource”指向相同来源的字符。 That way, I was able to use classical pattern-matching algorithms. 这样，我就可以使用经典的模式匹配算法。