简体   繁体   English

使用有限自动机作为容器的键

[英]Using finite automata as keys to a container

I have a problem where I really need to be able to use finite automata as the keys to an associative container. 我有一个问题,在这里我真的需要能够使用有限自动机作为关联容器的键。 Each key should actually represent an equivalence class of automata, so that when I search, I will find an equivalent automaton (if such a key exists), even if that automaton isn't structurally identical. 每个键实际上应该代表一个等价的自动机类,这样,当我搜索时,即使该自动机在结构上不相同,我也会找到一个等效的自动机(如果存在这样的键)。

An obvious last-resort approach is of course to use linear search with an equivalence test for each key checked. 当然,一种显而易见的最后解决方法是对每个检查的键使用线性搜索和等效测试。 I'm hoping it's possible to do a lot better than this. 我希望可以做得更好。

I've been thinking in terms of trying to impose an arbitrary but consistent ordering, and deriving an ordered comparison algorithm. 我一直在尝试强加一个任意但一致的排序,并推导一个排序比较算法。 First principles involve the sets of strings that the automata represent. 第一原则涉及自动机代表的字符串集。 Evaluate the set of possible first tokens for each automaton, and apply an ordering based on those two sets. 为每个自动机评估可能的第一个标记的集合,并基于这两个集合应用排序。 If necessary, continue to the sets of possible second tokens, third tokens etc. The obvious problem with doing this naively is that there's an infinite number of token-sets to check before you can prove equivalence. 如有必要,请继续使用可能的第二个令牌,第三个令牌等的集合。天真地这样做的一个明显问题是,在证明等价性之前,要检查的令牌集合数量是无限的。

I've been considering a few vague ideas - minimising the input automata first and using some kind of closure algorithm, or converting back to a regular grammar, some ideas involving spanning trees. 我一直在考虑一些模糊的想法-首先最小化输入自动机并使用某种闭包算法,或者转换回常规语法,其中一些想法涉及生成树。 I've come to the conclusion that I need to abandon the set-of-tokens lexical ordering, but the most significant conclusion I've reached so far is that this isn't trivial, and I'm probably better off reading up on someone elses solution. 我得出的结论是,我需要放弃令牌集的词法排序,但是到目前为止,我得出的最重要的结论是,这并非微不足道,我最好还是继续读下去。别人的解决方案。

I've downloaded a paper from CiteSeerX - Total Ordering on Subgroups and Cosets - but my abstract algebra isn't even good enough to know if this is relevant yet. 我已经从CiteSeerX下载了一篇论文- 关于子组和子集的总排序 -但是我的抽象代数甚至还不足以知道这是否相关。

It also occurred to me that there might be some way to derive a hash from an automaton, but I haven't given this much thought yet. 我也想到可能有某种方法可以从自动机中获取哈希值,但是我还没有考虑太多。

Can anyone suggest a good paper to read? 谁能推荐一篇好论文来阅读? - or at least let me know if the one I've downloaded is a red herring or not? -或者至少让我知道我下载的是不是鲱鱼?

I believe that you can obtain a canonical form from minimized automata. 我相信您可以从最小化的自动机中获得规范形式。 For any two equivalent automatons, their minimized forms are isomorphic (I believe this follows from Myhill-Nerode theorem). 对于任何两个等效的自动机,它们的最小化形式是同构的(我相信这来自Myhill-Nerode定理)。 This isomorphism respects edge labels and of course node classes (start, accepting, non-accepting). 这种同构关系涉及边缘标签,当然也涉及节点类(开始,接受,不接受)。 This makes it easier than unlabeled graph isomorphism. 这比未标记的图形同构更容易。

I think that if you build a spanning tree of the minimized automaton starting from the start state and ordering output edges by their labels, then you'll get a canonical form for the automaton which can then be hashed. 我认为,如果您从起始状态开始构建最小化自动机的生成树,并按其边缘对输出边缘进行排序,那么您将获得自动机的规范形式,然后可以对其进行哈希处理。

Edit: Non-tree edges should be taken into account too, but they can also be ordered canonically by their labels. 编辑:非树边缘也应考虑在内,但是它们的标签也可以规范地排序。

here is a thesis form 1992 where they produce canonical minimized automata: Minimization of Nondeterministic Finite Automata 这是1992年的论文形式,在其中他们产生了规范的最小化自动机: 非确定性有限自动机的最小化

Once you have the canonical, form you can easily hash it for example by performing a depth first enumeration of the states and transitions, and hashing a string obtained by encoding state numbers (count them in the order of their first appearance) for states and transitions as triples 拥有规范的表单后,您可以轻松地对其进行哈希处理,例如,通过对状态和过渡进行深度优先枚举,并对通过编码状态编号(按其首次出现的顺序对它们进行编码)而获得的字符串进行哈希处理三重

<from_state, symbol, to_state, is_accepting_final_state>

This should solve the problem. 这应该可以解决问题。

When a problem seems insurmountable, the solution is often to publicly announce how difficult you think the problem is. 当问题似乎无法解决时,解决方案通常是公开宣布您认为问题有多困难。 Then, you will immediately realise that the problem is trivial and that you've just made yourself look an idiot - and that's basically where I am now ;-) 然后,您将立即意识到这个问题是微不足道的,并且您只是使自己看起来很白痴-基本上这就是我现在的位置;-)

As suggested in the question, to lexically order the two automata, I need to consider two things. 正如问题中所建议的,要按词法对两个自动机进行排序,我需要考虑两件事。 The two sets of possible first tokens, and the two sets of possible everything-else tails. 两组可能的第一个标记,以及两组可能的所有其他标记。 The tails can be represented as finite automata, and can be derived from the original automata. 尾部可以表示为有限自动机,并且可以从原始自动机派生。

So the comparison algorithm is recursive - compare the head, if different you have your result, if the same then recursively compare the tail. 因此,比较算法是递归的-比较头部,如果结果不同,则相同,然后递归比较尾部。

The problem is the infinite sequence needed to prove equivalence for regular grammars in general. 问题是证明常规语法的等效性需要无限的序列。 If, during a comparison, a pair of automata recur, equivalent to a pair that you checked previously, you have proven equivalence and you can stop checking. 如果在比较期间重复出现一对自动机(相当于您先前检查过的一对),则证明您具有等效性,则可以停止检查。 It is in the nature of finite automata that this must happen in a finite number of steps. 有限自动机的本质是必须以有限的步骤进行。

The problem is that I still have a problem in the same form. 问题是我仍然有相同形式的问题。 To spot my termination criteria, I need to compare my pair of current automata with all the past automata pairs that occurred during the comparison so far. 为了确定我的终止条件,我需要将当前的自动机对与到目前为止在比较期间发生的所有过去的自动机对进行比较。 That's what has been giving me a headache. 那就是让我头疼的事情。

It also turns out that that paper is relevant, but probably only takes me this far. 事实证明,该论文是相关的,但可能只使我走了这么远。 Regular languages can form a group using the concatenation operator, and the left coset is related to the head:tail things I've been considering. 常规语言可以使用连接运算符组成一个组,而左边的coset与我一直在考虑的head:tail有关。

The reason I'm an idiot is because I've been imposing a far too strict termination condition, and I should have known it, because it's not that unusual an issue WRT automata algorithms. 我是个白痴的原因是因为我施加了过于严格的终止条件,我应该早就知道了,因为这不是WRT自动机算法的问题。

I don't need to stop at the first recurrence of an automata pair. 我不需要在自动机对的第一次复发时停止。 I can continue until I find a more easily detected recurrence - one that has some structural equivalence as well as logical equivalence. 我可以继续下去,直到找到更容易检测到的复发-既具有结构上的等效性又具有逻辑上的等效性。 So long as my derive-a-tail-automaton algorithm is sane (and especially if I minimise and do other cleanups at each step) I will not generate an infinite sequence of equivalent-but-different-looking automata pairs during the comparison. 只要我的“自动尾部自动机”算法是理智的(尤其是如果我在每个步骤进行最小化并执行其他清除操作),在比较期间,我就不会生成无限个等价但外观不同的自动机对序列。 The only sources of variation in structure are the original two automata and the tail automaton algorithm, both of which are finite. 结构变化的唯一来源是原始的两个自动机和尾部自动机算法,两者都是有限的。

The point is that it doesn't matter that much if I compare too many lexical terms - I will still get the correct result, and while I will terminate a little later, I will still terminate in finite time. 关键是,如果我比较太多的词汇术语并没有多大关系-我仍然会得到正确的结果,尽管我稍后会终止,但仍会在有限的时间内终止。

This should mean that I can use an unreliable recurrence detection (allowing some false negatives) using a hash or ordered comparison that is sensitive to the structure of the automata. 这应该意味着我可以使用对自动机的结构敏感的哈希或有序比较来进行不可靠的重复检测(允许某些假阴性)。 That's a simpler problem than the structure-insensitive comparison, and I think it's the key that I need. 这是一个比不敏感结构的比较简单的问题,我认为这是我需要的关键。

Of course there's still the issue of performance. 当然,仍然存在性能问题。 A linear search using a standard equivalence algorithm might be a faster approach, based on the issues involved here. 基于此处涉及的问题,使用标准等效算法进行线性搜索可能是一种更快的方法。 Certainly I would expect this comparison to be a less efficient equivalence test than existing algorithms, as it is doing more work - lexical ordering of the non-equivalent cases. 当然,我希望这种比较比现有算法效率更低,因为它正在做更多的工作-非等价情况的词法排序。 The real issue is the overall efficiency of a key-based search, and that is likely to need some headache-inducing analysis. 真正的问题是基于关键字的搜索的整体效率,这可能需要进行一些令人头痛的分析。 I'm hoping that the fact that non-equivalent automata will tend to compare quickly (detecting a difference in the first few steps, like traditional string comparisons) will make this a practical approach. 我希望非等价自动机趋向于快速比较这一事实(检测到前几个步骤中的差异,就像传统的字符串比较一样)将使这种方法变得实用。

Also, if I reach a point where I suspect equivalence, I could use a standard equivalence algorithm to check. 另外,如果到达怀疑等效性的程度,则可以使用标准等效性算法进行检查。 If that check fails, I just continue comparing for the ordering where I left off, without needing to check for the tail language recurring - I know that I will find a difference in a finite number of steps. 如果该检查失败,我将继续比较上次中断的顺序,而无需检查重复的尾部语言-我知道我将在有限的步骤中找到不同之处。

If all you can do is == or !=, then I think you have to check every set member before adding another one. 如果您只能做==或!=,那么我认为您必须在添加每个成员之前检查每个集合成员。 This is slow. 太慢了 (Edit: I guess you already know this, given the title of your question, even though you go on about comparison functions to directly compare two finite automata.) (编辑:给定问题的标题,即使您继续使用比较函数直接比较两个有限自动机,我想您也已经知道了。)

I tried to do that with phylogenetic trees, and it quickly runs into performance problems. 我试图用系统发育树来做到这一点,但很快就遇到了性能问题。 If you want to build large sets without duplicates, you need a way to transform to a canonical form. 如果要构建没有重复项的大型集,则需要一种转换为规范形式的方法。 Then you can check a hash, or insert into a binary tree with the string representation as a key. 然后,您可以检查哈希,或以字符串表示形式为键插入二叉树。

Another researcher who did come up with a way to transform a tree to a canonical rep used Patricia trees to store unique trees for duplicate-checking. 另一位提出了将树转换为规范代表的方法的研究人员使用Patricia树来存储唯一的树以进行重复检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM