简体繁体 English

由单词和分隔符组成的分层字符串的近似字符串匹配

[英]Approximate string matching for hierarchical strings composed of words and separators

原文 2016-02-11 08:31:13 8 1 string/ algorithm

I'm looking for a data structure which supports matching strings against a set of patterns where the strings represent mqtt topics . 我正在寻找一种数据结构，该结构可支持针对一组表示mqtt主题的模式进行匹配的字符串。 The strings are defined to be composed of words ("topic level") separated by a slash character. 字符串定义为由用斜杠字符分隔的单词（“主题级别”）组成。 Examples for strings would be "topic1/topic2" or "//topic1/topic2" which contains an empty topic level. 字符串的示例为“ topic1 / topic2”或“ // topic1 / topic2”，其中包含一个空的主题级别。 The character set is UTF-8 excluding '#' and '+'. 字符集是UTF-8，不包括“＃”和“ +”。

Patterns are topic strings but can contain two wildcards. 模式是主题字符串，但可以包含两个通配符。 The first wildcard character "#" can only be used at the end of a pattern and matches an arbitrary number of following topics, ie "a/#" matches any strings where "a/" is a prefix. 第一个通配符“＃”只能在模式的末尾使用，并且可以匹配任意数量的以下主题，即“ a /＃”可以匹配以“ a /”为前缀的任何字符串。 The second pattern "+" matches a single arbitrary topic. 第二个模式“ +”匹配单个任意主题。 For example, “sport/tennis/+” matches “sport/tennis/player1” and “sport/tennis/player2”, but not “sport/tennis/player1/ranking”. 例如，“ sport / tennis / +”匹配“ sport / tennis / player1”和“ sport / tennis / player2”，但不匹配“ sport / tennis / player1 / rank”。 Also, because the single-level wildcard matches only a single level, “sport/+” does not match “sport” but it does match “sport/”. 另外，由于单级通配符仅匹配单个级别，因此“ sport / +”不匹配“ sport”，但确实匹配“ sport /”。

The use-case is that clients register for interesting topics providing a pattern. 用例是客户注册有趣的主题以提供一种模式。 When a message is sent, it is published with a topic string. 发送消息时，消息会与主题字符串一起发布。 The string has to be matched against registered subscribers, so I am looking for a data structure that efficiently (in terms of space and time) selects the subscribers whose registered patterns match the published topic. 该字符串必须与注册的订阅者匹配，因此我正在寻找一种数据结构，该数据结构可以有效地（就空间和时间而言）选择其注册模式与已发布主题匹配的订阅者。

I was thinking about using a suffix tree or trie because this would allow fast prefix matches when "#" is used. 我正在考虑使用后缀树或trie，因为当使用“＃”时，这将允许快速的前缀匹配。 The nodes in the trie would contain the subscribers for this string, and a set of all subscribers of sub-strings. 特里树中的节点将包含此字符串的订阅者，以及一组所有子字符串的订阅者。 This should allow quick look-ups for exact and prefix queries, but I don't know if this supports the "+" wildcard. 这应该允许快速查询精确查询和前缀查询，但是我不知道它是否支持“ +”通配符。

Another approach I am thinking of is to create a directed graph where each node contains one topic and an edge topic1 -> topic2 if there is a sub-string "topic1/topic2" in a pattern. 我正在考虑的另一种方法是创建一个有向图，如果模式中有一个子字符串“ topic1 / topic2”，则每个节点都包含一个主题和一个边缘topic1 -> topic2 。 With this graph, I could traverse the nodes topic by topic. 使用此图，我可以逐个主题遍历节点。 A "+" wildcard would just mean to traverse to all children. “ +”通配符仅表示遍历所有子代。

An obvious alternative are regular expressions which would result in a finite state-machine which is probably similar to the graph approach. 正则表达式是一个明显的选择，它将导致状态机有限，这可能与图方法相似。 However, I was hoping to find something faster. 但是，我希望更快地找到一些东西。

The algorithm should be used in a mqtt broker where subscribers can register and deregister topics any time, so it must also support updating the search data structure by adding or removing patterns. 该算法应在订阅者可以随时注册和注销主题的mqtt代理中使用，因此它还必须支持通过添加或删除模式来更新搜索数据结构。

1 个解决方案

Aho-corasick finite-state-machine supports wildcards. Aho-corasick有限状态机支持通配符。 You can also reverse a trie and search for wildcards: http://phpir.com/tries-and-wildcards/ 您还可以反转特里并搜索通配符： http : //phpir.com/tries-and-wildcards/