简体   繁体   English

正则表达式以分组提取子字符串

[英]Regular expression to extract sub string in groups

I want to extract names from the following input using regular expression. 我想使用正则表达式从以下输入中提取名称。

Student Names:
    Name1
    Name2
    Name3

Parent Names:
    Name1
    Name2
    Name3

I am using the following method to match the data and I am not supposed to modify the method. 我正在使用以下方法来匹配数据,并且不应该修改该方法。 I have to come up with regular expression that works with this method. 我必须想出与此方法一起使用的正则表达式。

public void parseName(String patternRegX){

        Pattern patternDomainStatus = Pattern.compile(patternRegX);
        Matcher matcherName = patternName.matcher(inputString);
        List<String> tmp=new ArrayList<String>();

        while (matcherName.find()){
            if (!matcherName.group(2).isEmpty())
                tmp.add(matcherName.group(2));
        }
}

I came up with a regular expression that could get me the desired result, but the problem I found was that grouping doesn't work inside square brackets([]). 我想出了一个可以使我得到理想结果的正则表达式,但是我发现的问题是分组在方括号([])中不起作用。

private String studentRegX="(Student Names:\\n[ + (\\S+) \\n]+\\n)"; private String studentRegX =“(学生姓名:\\ n [+ (\\ S +) \\ n] + \\ n)”;

I am using the following regular expression now, but that is getting me only the last name in each set. 我现在使用下面的正则表达式,但是那只是给我每个集合中的姓。

 private String studentRegX="Student Names:\\\\n( +(\\\\S+)\\\\n)+\\\\n"; private String parentRegX="Parent Names:\\\\n( +(\\\\S+)\\\\n)+\\\\n"; 

Thank you in advance for the help. 预先感谢您的帮助。

First of all, I hope you can change the parseName method a little bit, because it doesn't compile. 首先,我希望您可以稍微更改parseName方法,因为它不会编译。 patternDomainStatus and patternName are probably supposed to refer to the same object: patternDomainStatuspatternName可能应该引用同一对象:

    Pattern pattern = Pattern.compile(patternRegX);
    Matcher matcherName = pattern.matcher(inputString);

Secondly, you need to think about your regex a little differently. 其次,您需要对正则表达式有所不同。

Right now, your regexes are trying to match entire chunks with multiple names in them. 现在,您的正则表达式正尝试将其中的多个名称与整个块匹配。 But matcherName.find() finds "the next subsequence of the input sequence that matches the pattern" (per the javadoc). 但是matcherName.find()找到“与模式匹配的输入序列的下一个子序列”(根据javadoc)。

So what you want is a regex that matches a single name . 因此,您想要的是一个与单个名称匹配正则表达式。 matcherName.find() will loop through each part of your string that matches that regex. matcherName.find()将遍历与该正则表达式匹配的字符串的每个部分。

If you're not already familiar with the difference between repeating a capturing group and capturing a repeating group, that's worth reading up on. 如果您还不熟悉重复捕获组和捕获重复组之间的区别,那么值得继续阅读。 One resource for that is http://www.regular-expressions.info/captureall.html , but others would be fine too. 一种资源是http://www.regular-expressions.info/captureall.html ,但是其他资源也可以。

If you already knew about that difference and were trying to capture a repeating group already with what you've written above, then please edit your post to explain what you're trying to do (a letter-by-letter explanation would be ideal, so we see what you understand and what you don't, so we can help you with whatever you're stuck on). 如果您已经了解了这种差异,并尝试使用上面编写的内容来捕获重复的小组,那么请编辑您的帖子以说明您要做什么(逐个字母的解释比较理想,因此我们会了解您的了解以及您不了解的内容,因此我们可以为您提供帮助。

I see what I believe is the solution, but since this is clearly homework, I'm not willing to simply give it to you. 我看到了我认为的解决方案,但是由于这显然是家庭作业,因此我不愿意将其提供给您。 But I'd be happy to help you figure it out. 但我很乐意帮助您解决问题。

--- Edit: --- -编辑:-

You're only getting one match because the regex requires "Student Names:" or "Parent Names:" to be in each match, so you can only match once. 您只会得到一个匹配项,因为正则表达式要求在每个匹配项中都包含 “学生姓名:”或“父母姓名:”,因此您只能匹配一次。 For your regex to match multiple times in a row (as required by the while (matcherName.find()) ), you need to get the "Student Names:" and "Parent Names:" out of the regex, so the regex can match repeatedly. 为了使您的正则表达式连续匹配多次(根据while (matcherName.find()) ),您需要从正则表达式中获取“学生姓名:”和“父母姓名:”,因此正则表达式可以反复比赛。

It's easy to get all of the names (both students and parents), with just a regex that looks for newlines followed by one or more spaces and then text. 只需使用一个正则表达式查找所有换行符,后跟一个或多个空格然后输入文本,就很容易获得所有名称(包括学生和家长)。 The challenge is to differentiate the student names (which come before the "Parent Names:" line) from the parent names (which come after the "Parent Names:" line). 挑战在于将学生姓名(在“父母姓名:”行之前)与父母姓名(在“父母姓名:”行之后)区分开来。 The key concept for differentiating between them is lookaheads , which can be positive or negative. 区分它们的关键概念是前瞻 ,可以是肯定的也可以是否定的。 Take a look at them and see if you can figure out how to implement this using lookaheads. 看一看它们,看看是否可以弄清楚如何使用lookaheads来实现它。

Also, you may find that group #2 isn't the group you really want to use. 另外,您可能会发现第2组不是您真正想要使用的组。 It's unfortunate that the group number is hard-coded, but since it is, you can tweak your regex to make groups non-capturing with (?:stuff) syntax. 不幸的是,组号是硬编码的,但是既然如此,您可以调整正则表达式以使组不使用(?:stuff)语法捕获。 That will let you reduce the number of groups and ensure that the group you actually want is #2. 这样可以减少组的数量,并确保您实际想要的组是#2。

Because regex has little to do with algorithmic prowess, here an answer: 因为正则表达式与算法能力无关,所以这里给出一个答案:

  • On Windows the line break is unfortunately "\\r\\n". 在Windows上,换行符很不幸是“ \\ r \\ n”。
  • I check that a newline preceded and that there is at least some white space before the name. 我检查是否在换行符之前,并且名称前至少有一些空格。
  • The name may have a space. 名称可能有空格。
  • With look-behind I check that "Parent Names" follows. 通过后向检查,我检查是否有“父母姓名”。

Then 然后

Pattern.compile("(?s)(?<=\n)[ \t]+([^\r\n]*)\r?\n(?=.*Parent Names)");
//               ~~~~ '.' also matches newline
//                   ~~~~~~~ look-behind must be newline
//                          ~~~~~~ whitespace (spaces/tabs)
//                                ~~~~~~~~~~ group 1, name
//                                               ~~~~~~~~~~~~~~~~~~~~ look-ahead

Without say, a bit different algorithm would be more solid and understandable. 不言而喻,稍微不同的算法将更可靠。

To make it group(2) instead of the above group(1), you could introduce extra braces before: ([ \\t]+) 要使其成为group(2)而不是上述group(1),可以在前面加上大括号: ([ \\t]+)

It can be done using the \\G anchor all in a single regex. 可以使用\\G锚在单个正则表达式中完成。
This opens it up for a little regex algorithmic prowess. 这为正则表达式的一些算法能力打开了大门。 Each match will be either: 每场比赛将是:

  • Group 1 is not NULL/empty - New student group, group 3 will contain first student name. 组1不能为NULL /空-新的学生组,组3将包含第一个学生姓名。
  • Group 2 is not NULL/empty - New parent group, group 3 will contain first parent name. 组2不能为NULL /空-新的父组,组3将包含第一个父名称。
  • Group 3 is never NULL/empty - The first/next either student or parent name depending on which 第3组永远不会为NULL /空-根据学生或家长的名字而定的第一个/下一个
    group 1 or 2 last matched. 第1或第2组最后匹配。

In all cases, group 3 will contain a name that has been trimmed and ready to put into an array. 在所有情况下,第3组将包含已修剪并准备放入数组的名称。


 # "~(?mi-)(?:(?!\\A)\\G|^(?:(Student)|(Parent))[ ]Names:)\\s*^(?!(?:Student|Parent)[ ]Names:)[^\\S\\r\\n]*(.+?)[^\\S\\r\\n]*$~"

 (?xmi-)                     # Inline 'Expanded, multiline, case insensitive' modifiers
 (?:
      (?! \A )                    # Here, matched before, give Name a first chance
      \G                          # to match again.
   |  
      ^                           # BOL
      (?:
           ( Student )                 # (1), New 'Student' group
        |  ( Parent )                  # (2), New 'Parent' group
      )
      [ ] Names: 
 )
                             # Name section
 \s*                         # Consume all whitespace up until the start of a Name line
 ^                           # BOL 
 (?!
      (?: Student | Parent )      # Names only, Not the start of Student/Parent group here
      [ ] Names:
 )
 [^\S\r\n]*                  # Trim leading whitespace ( can use \h if supported )
 ( .+? )                     # (3), the Name
 [^\S\r\n]*                  # Trim trailing whitespace ( can use \h if supported )
 $                           # EOL

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM