简体繁体 English

Java中的增量模式（RegEx）匹配？

[英]Incremental Pattern (RegEx) matching in Java?

原文 2012-10-09 16:32:49 3 1 java/ regex

Is there a way or an efficient library that allows for incremental regular expression matching in Java? 有没有一种方法或一个有效的库允许在Java中进行增量正则表达式匹配？

What I mean by that is, I would like to have an OutputStream that I can send a couple bytes at a time to and that keeps track of matching the data so far against a regular expression. 我的意思是，我希望有一个OutputStream，我可以一次发送几个字节，并跟踪到目前为止与正则表达式匹配的数据。 If a byte is received that will cause this regular expression to definitely not match, I would like the stream to tell me so. 如果接收到一个会导致这个正则表达式绝对不匹配的字节，我希望该流告诉我。 Otherwise it should keep me informed about the current best match, if any. 否则它应该让我知道当前的最佳匹配，如果有的话。

I realize that this is likely to be an extremely difficult and not well defined problem, since one can imagine regular expressions that can match a whole expression or any part of it or not have a decision until the stream is closed anyways. 我意识到这可能是一个非常困难且没有明确定义的问题，因为可以想象正则表达式可以匹配整个表达式或其任何部分，或者在流被关闭之前没有决定。 Even something as trivial as .* can match H, He, Hel, Hell, Hello, and so forth. 即使像。*这样微不足道的东西也可以匹配H，He，Hel，Hell，Hello等等。 In such a case, I would like the stream to say: "Yes, this expression could match if it was over now, and here are the groups it would return." 在这种情况下，我希望该流说：“是的，如果它现在结束，这个表达式可以匹配，这里是它将返回的组。”

But if Pattern internally steps through the string it matches character by character, it might not be so hard? 但是如果Pattern内部逐步遍历字符串，它会逐个字符匹配，那可能不是那么难吗？

1 个解决方案

Incremental matching can be nicely achieved by computing the finite state automaton corresponding to a regular expression, and performing state transitions on that while processing the characters of the input. 通过计算对应于正则表达式的有限状态自动机 ，并在处理输入的字符时对其执行状态转换，可以很好地实现增量匹配。 Most lexers work this way. 大多数词法分子以这种方式工作。 This approach won't work well for groups , though. 但是，这种方法对于群组来说效果不佳。

So perhaps you could make this two parts: have one matcher which figures out whether there is any match at all, or any chance of a match in the future. 所以也许你可以把这两部分做成：有一个匹配器可以判断出是否有任何匹配，或者将来是否有匹配的机会。 You can use that to give you a quick reply after every input character. 您可以使用它在每个输入字符后快速回复。 Once you have a complete match, you can exucte a backtracking and grouping regular expression engine to identify your matching groups. 完成匹配后，您可以执行回溯并对正则表达式引擎进行分组以识别匹配的组。 In some cases, it might be feasible to encode the grouping stuff into the automaton as well, but I can't think of a generic way to accomplish this. 在某些情况下，将分组内容编码到自动机中也是可行的，但我想不出一种通用的方法来实现这一点。