简体   繁体   English

我需要什么Java正则表达式来匹配此文本?

[英]What Java regular expression do I need to match this text?

I'm trying to match the following using a regular expression in Java - I have some data separated by the two characters 'ZZ'. 我正在尝试使用Java中的正则表达式来匹配以下内容-我有一些数据用两个字符“ ZZ”分隔。 Each record starts with 'ZZ' and finishes with 'ZZ' - I want to match a record with no ending 'ZZ' for example, I want to match the trailing 'ZZanychars' below (Note: the *'s are not included in the string - they're just marking the bit I want to match). 每个记录都以“ ZZ”开头,以“ ZZ”结尾-例如,我想匹配一个没有结尾“ ZZ”的记录,我想匹配下面的尾随“ ZZanychars”(注意:*不包括在字符串-他们只是标记我要匹配的位)。

ZZanycharsZZZZanycharsZZ ZZanychars ZZanycharsZZZZanycharsZZ ZZanychars

But I don't want the following to match because the record has ended: 但我不希望以下内容匹配,因为记录已结束:

ZZanycharsZZZZanycharsZZZZanycharsZZ ZZanycharsZZZZanycharsZZZZanycharsZZ

EDIT: To clarify things - here are the 2 testcases I am using: 编辑:澄清一下-这是我正在使用的2个测试用例:

// This should match and in one of the groups should be 'ZZthree'
String testString1 = "ZZoneZZZZtwoZZZZthree";

// This should not match
String testString2 = "ZZoneZZZZtwoZZZZthreeZZ";

EDIT: Adding a third test: 编辑:添加第三个测试:

// This should match and in one of the groups should be 'threeZee'
String testString3 = "ZZoneZZZZtwoZZZZthreeZee";

(Edited after the post of the 3rd example) (在第三个示例发布后编辑)

Try: 尝试:

(?!ZZZ)ZZ((?!ZZ).)++$

Demo: 演示:

import java.util.regex.*;

public class Main {
    public static void main(String[] args) {
        String[] tests = {
            "ZZoneZZZZtwoZZZZthree",
            "ZZoneZZZZtwoZZZZthreeZZ",
            "ZZoneZZZZtwoZZZZthreeZee"
        };
        Pattern p = Pattern.compile("(?!ZZZ)ZZ((?!ZZ).)++$");
        for(String tst : tests) {
            Matcher m = p.matcher(tst);
            System.out.println(tst+" -> "+(m.find() ? m.group() : "no!"));
        }
    }
}

To match only the final, unterminated record: 仅匹配最终的,未终止的记录:

(?<=[^Z]ZZ|^)ZZ(?:(?!ZZ).)++$

The starting delimiter is two Z 's, but there can be a third Z that's considered part of the data. 起始定界符是两个Z ,但是可以有第三个Z被视为数据的一部分。 The lookbehind ensures that you don't match a Z that's part of the previous record's ending delimiter (since an ending delimiter can not be preceded by a non-delimiter Z ). 向后查找确保您不匹配前一条记录的结束定界符的一部分Z (因为结束定界符不能以非定界符Z )。 However, this assumes there will never be empty records (or records containing only a single Z ), which could lead to eight or more Z 's in a row: 但是,这假设不会有空记录(或仅包含单个Z记录),这可能导致连续出现八个或更多Z

ZZabcZZZZdefZZZZZZZZxyz

If that were possible, I would forget about trying to match the final record by itself, and instead match all of them from the beginning: 如果可能的话,我会忘记尝试单独匹配最终记录,而是从头开始匹配所有记录:

(?:ZZ(?:(?!ZZ).)*+ZZ)*+(ZZ(?:(?!ZZ).)++$)

The final, unterminated record is now captured in group #1. 现在,最终的,未终止的记录被捕获在#1组中。

I'd suggest something like... 我建议像...

/ZZ(.*?)(ZZ|$)/

This will match: 这将匹配:

  1. ZZ — the literal string ZZ —文字字符串
  2. (.*?) — anychars (.*?) — anychars
  3. (ZZ|$) — either another ZZ literal, or the end of the string (ZZ|$) -另一个ZZ文字或字符串的结尾
^ZZ.*(?<!ZZ)$


Assert position at the beginning of the string «^»
Match the characters “ZZ” literally «ZZ»
Match any single character that is not a line break character «.*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!ZZ)»
   Match the characters “ZZ” literally «ZZ»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»


Created with RegexBuddy

There's one tricky part to this: The ZZ being both the start token and the end token. 这有一个棘手的部分:ZZ既是开始令牌又是结束令牌。

There's one start case (ZZ, not followed by another ZZ which would signify that the first ZZ was actually an end token), and two end cases (ZZ end of string, ZZ followed by ZZ). 有一个开始情况(ZZ,之后没有另一个ZZ,这表示第一个ZZ实际上是一个结束标记),还有两个结束情况(ZZ字符串的末尾,ZZ后跟ZZ)。 The goal is to match the start case and NOT either of the end cases. 目标是匹配开始情况,而不匹配任何一种结束情况。

To that end, here's what I suggest: 为此,我提出以下建议:

/ZZ(?!ZZ)(.*?)(ZZ(?!(ZZ|$))|$)/

For string ZZfooZZZZbarZZbazZZ : 对于字符串ZZfooZZZZbarZZbazZZ

  • This will NOT match ZZfooZZ, a legitimate record: ZZ, not followed by ZZ, followed by any combination of characters (here "foo"), followed by ZZ, but that ZZ is followed by ZZ, which opens the next record. 这将与合法记录ZZfooZZ不匹配:ZZ,不跟ZZ,后跟字符的任意组合(此处为“ foo”),再跟ZZ,但是ZZ跟着ZZ,以打开下一个记录。
  • The next part examined is the ZZ after foo. 下一部分是foo之后的ZZ。 This fails because the ZZ cannot be followed by another ZZ, yet in this case it is. 之所以失败,是因为在ZZ之后不能再跟随另一个ZZ。 This is as we want because the ZZ right after foo does not start a new record anyway. 这就是我们想要的,因为foo之后的ZZ始终不会开始新的记录。
  • The ZZ right before bar is not followed by another ZZ, so it's a legitimate start of record. 在小节前的ZZ之后没有另一个ZZ,因此这是合法的记录开始。 "bar" is consumed by the .*?. “ .bar”被。*占用。 Then there is a ZZ, but it is NOT followed by another ZZ or the end of string, which means that the ZZbar token is no good. 然后有一个ZZ,但后面没有另一个ZZ或字符串的结尾,这意味着ZZbar标记不好。
    • (It COULD be interpreted by a human as ZZbarZZ with bazZZ not being valid, but in either case there's something wrong, so I just wrote the regex to consider the wrongly-formatted record to occur here) (它可能被人类解释为ZZbarZZ,而bazZZ无效,但是在任何情况下都出了问题,所以我只写了正则表达式来考虑格式错误的记录会在此处出现)
    • So ZZbar will be caught/matched by the regex, as illegitimate. 因此,ZZbar将被非法正则表达式捕获/匹配。
  • The ZZ after the bar isn't followed by ZZ, is followed by baz, followed by a ZZ that fails the lookahead assertion stating it can't be followed by the end of the string. 在小节之后的ZZ后面没有ZZ,接着是baz,接着是一个ZZ,该ZZ无法通过前瞻性断言声明字符串的末尾不能跟随它。 So ZZbazZZ is a legitimate record and is not captured in the regex. 因此,ZZbazZZ是合法记录,不会在正则表达式中捕获。

One more case: For ZZfoo , the beginning ZZ is okay, the foo is captured, then the regex notes that it's the end of the string, and no ZZ has occurred. ZZfoo一种情况:对于ZZfoo ,开始的ZZ没问题,捕获foo,然后正则表达式指出它是字符串的结尾,并且没有ZZ发生。 Thus, ZZfoo is captured as an illegitimate match. 因此,ZZfoo被捕获为非法匹配。

Let me know if this doesn't make sense, so I can make it more clear. 让我知道这是否没有道理,因此我可以更清楚地说明。

如何尝试删除ZZallcharsZZ的所有匹配项,剩下的就是您想要的。

ZZ.*?ZZ

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM