简体   繁体   English

由单词和分隔符组成的分层字符串的近似字符串匹配

[英]Approximate string matching for hierarchical strings composed of words and separators

I'm looking for a data structure which supports matching strings against a set of patterns where the strings represent mqtt topics . 我正在寻找一种数据结构,该结构可支持针对一组表示mqtt主题的模式进行匹配的字符串。 The strings are defined to be composed of words ("topic level") separated by a slash character. 字符串定义为由用斜杠字符分隔的单词(“主题级别”)组成。 Examples for strings would be "topic1/topic2" or "//topic1/topic2" which contains an empty topic level. 字符串的示例为“ topic1 / topic2”或“ // topic1 / topic2”,其中包含一个空的主题级别。 The character set is UTF-8 excluding '#' and '+'. 字符集是UTF-8,不包括“#”和“ +”。

Patterns are topic strings but can contain two wildcards. 模式是主题字符串,但可以包含两个通配符。 The first wildcard character "#" can only be used at the end of a pattern and matches an arbitrary number of following topics, ie "a/#" matches any strings where "a/" is a prefix. 第一个通配符“#”只能在模式的末尾使用,并且可以匹配任意数量的以下主题,即“ a /#”可以匹配以“ a /”为前缀的任何字符串。 The second pattern "+" matches a single arbitrary topic. 第二个模式“ +”匹配单个任意主题。 For example, “sport/tennis/+” matches “sport/tennis/player1” and “sport/tennis/player2”, but not “sport/tennis/player1/ranking”. 例如,“ sport / tennis / +”匹配“ sport / tennis / player1”和“ sport / tennis / player2”,但不匹配“ sport / tennis / player1 / rank”。 Also, because the single-level wildcard matches only a single level, “sport/+” does not match “sport” but it does match “sport/”. 另外,由于单级通配符仅匹配单个级别,因此“ sport / +”不匹配“ sport”,但确实匹配“ sport /”。

The use-case is that clients register for interesting topics providing a pattern. 用例是客户注册有趣的主题以提供一种模式。 When a message is sent, it is published with a topic string. 发送消息时,消息会与主题字符串一起发布。 The string has to be matched against registered subscribers, so I am looking for a data structure that efficiently (in terms of space and time) selects the subscribers whose registered patterns match the published topic. 该字符串必须与注册的订阅者匹配,因此我正在寻找一种数据结构,该数据结构可以有效地(就空间和时间而言)选择其注册模式与已发布主题匹配的订阅者。

I was thinking about using a suffix tree or trie because this would allow fast prefix matches when "#" is used. 我正在考虑使用后缀树或trie,因为当使用“#”时,这将允许快速的前缀匹配。 The nodes in the trie would contain the subscribers for this string, and a set of all subscribers of sub-strings. 特里树中的节点将包含此字符串的订阅者,以及一组所有子字符串的订阅者。 This should allow quick look-ups for exact and prefix queries, but I don't know if this supports the "+" wildcard. 这应该允许快速查询精确查询和前缀查询,但是我不知道它是否支持“ +”通配符。

Another approach I am thinking of is to create a directed graph where each node contains one topic and an edge topic1 -> topic2 if there is a sub-string "topic1/topic2" in a pattern. 我正在考虑的另一种方法是创建一个有向图,如果模式中有一个子字符串“ topic1 / topic2”,则每个节点都包含一个主题和一个边缘topic1 -> topic2 With this graph, I could traverse the nodes topic by topic. 使用此图,我可以逐个主题遍历节点。 A "+" wildcard would just mean to traverse to all children. “ +”通配符仅表示遍历所有子代。

An obvious alternative are regular expressions which would result in a finite state-machine which is probably similar to the graph approach. 正则表达式是一个明显的选择,它将导致状态机有限,这可能与图方法相似。 However, I was hoping to find something faster. 但是,我希望更快地找到一些东西。

The algorithm should be used in a mqtt broker where subscribers can register and deregister topics any time, so it must also support updating the search data structure by adding or removing patterns. 该算法应在订阅者可以随时注册和注销主题的mqtt代理中使用,因此它还必须支持通过添加或删除模式来更新搜索数据结构。

Aho-corasick finite-state-machine supports wildcards. Aho-corasick有限状态机支持通配符。 You can also reverse a trie and search for wildcards: http://phpir.com/tries-and-wildcards/ 您还可以反转特里并搜索通配符: http : //phpir.com/tries-and-wildcards/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM