简体   繁体   English

正则表达式嵌套方括号(忽略方括号和空格)

[英]Regex nested brackets (ignoring inside brackets and whitespace)

I am trying to create a regex pattern that reads through a bibTex citation file and match everything inside the brackets. 我正在尝试创建一个正则表达式模式,该模式可读取bibTex引文文件并匹配方括号内的所有内容。 For those who don't know, a bibtex citation looks like the following : 对于那些不知道的人,bibtex的引用如下所示:

@INPROCEEDINGS{Fogel95,
  AUTHOR =       {L. J. Fogel and P. J. Angeline and D. B. Fogel},
  TITLE =        {An evolutionary programming approach to self-adaptation
                    on finite state machines},
  BOOKTITLE =    {Proceedings of the Fourth International Conference on
                    Evolutionary Programming},
  YEAR =         {1995},
  pages =        {355--365}
}

@ARTICLE{Goldberg91,
  AUTHOR =       {D. Goldberg},
  TITLE =        {Real-coded genetic algorithms, virtual alphabets, and blocking},
  JOURNAL =      {Complex Systems},
  YEAR =         {1991},
  pages =        {139--167}
}

@INPROCEEDINGS{Yao96,
  AUTHOR =       {X. Yao and Y. Liu},
  TITLE =        {Fast evolutionary programming},
  BOOKTITLE =    {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary
                    Programming},
  YEAR =         {1996},
  pages =        {451--460}
}

The current pattern I have is as follows: 我目前的模式如下:

@(\\w+)\{(\\w+),\\s*((\\w+)\\s*=\\s*(\\"|\\{)?(.+)(\\"|\\})?,?\\s*)+\\}

This pattern matches the second citation but only parts of the first and third. 此模式与第二次引用相匹配,但仅部分匹配第一和第三次引用。 I know the reason it doesn't match the third citation is because of the brackets within the left hand side of the citation ( 6$^ { th } $ ) and I have figured out that it won't match citations that have whitespaces/newlines within the left hand side of the citation elements 我知道它与第三次引用不匹配的原因是由于该引用左侧的方括号(6 $ ^ { th } $),而且我已经发现它与具有空格/引用元素左侧的换行符

BOOKTITLE =    {Proceedings of the Fourth International Conference on
                Evolutionary Programming},
//This part of the citation has a newline in the middle of it.

Now I have been slaving away trying to fix my pattern, but the thing with regular expressions that I have found, is that the longer I try to fix the expression/add new conditions to it, the more confusing it gets. 现在,我一直在竭尽全力尝试修复我的模式,但是发现的带有正则表达式的东西是,我尝试修复表达式/为其添加新条件的时间越长,它就会变得更加混乱。 I am just wondering how I capture the whole citation regardless of inner brackets/parenthesis. 我只是想知道如何捕获整个引文,而不考虑内括号/括号。 Some citations contain no brackets/parenthesis after the "=" sign at all. 一些引用在“ =”之后根本没有方括号/括号。 Any help, along with an explanation would be greatly appreciated. 任何帮助,以及解释将不胜感激。 I have looked at similar examples which have only confused me more due to the difficulty of deciphering a regular expression by simply glancing at it. 我看过类似的示例,这些示例仅使我更困惑,这是因为仅通过浏览一下正则表达式就很难理解。 Thank you. 谢谢。

The simplest way to capture everything between curly braces is: 捕获花括号之间的所有内容的最简单方法是:

\{([^}]+)}

The negation [^}] includes all character not a curly bracket, including newlines. 否定[^}]包括不带大括号的所有字符, 包括换行符。

Regex is not a good parser for text with nested blocks. 对于带有嵌套块的文本,正则表达式不是很好的解析器。

If you insist on using regex, you should match the outer part first: 如果您坚持使用正则表达式,则应首先匹配外部部分:

@INPROCEEDINGS{Fogel95,
  ???
}

Capture the ??? 捕获??? , so you can match on that in a nested loop. ,因此您可以在嵌套循环中进行匹配。

The outer regex would be something like @(\\w+)\\{(\\w+),([^{}]*(?:\\{[^{}]*\\}[^{}]*)*)\\} 外部正则表达式将类似于@(\\w+)\\{(\\w+),([^{}]*(?:\\{[^{}]*\\}[^{}]*)*)\\}

The inner regex would be something like (\\w+)\\s*=\\s*\\{([^}]*)\\} 内部正则表达式类似于(\\w+)\\s*=\\s*\\{([^}]*)\\}

Since a field value may be wrapped on multiple lines, you need to unwrap that. 由于字段值可能会包裹在多行中,因此您需要对其进行拆包。

Code

Pattern pTag = Pattern.compile("@(\\w+)" + // tag
                               "\\{" +
                                  "(\\w+)" + // name
                                  "," +
                                  "([^{}]*(?:\\{[^{}]*\\}[^{}]*)*)" + // content
                               "\\}");
Pattern pField = Pattern.compile("(\\w+)" + // field
                                 "\\s*=\\s*" +
                                 "\\{" +
                                    "([^}]*)" + // value
                                 "\\}");
Pattern pNewline = Pattern.compile("\\s*(?:\\R\\s*)+");
for (Matcher mTag = pTag.matcher(input); mTag.find(); ) {
    String tag = mTag.group(1);
    String name = mTag.group(2);
    String content = mTag.group(3);
    for (Matcher mField = pField.matcher(content); mField.find(); ) {
        String field = mField.group(1);
        String value = mField.group(2);
        value = pNewline.matcher(value).replaceAll(" ");
        System.out.printf("%-15s %-12s %-11s %s%n", tag, name, field, value);
    }
}

Test input 测试输入

String input = "@INPROCEEDINGS{Fogel95,\n" +
               "  AUTHOR =       {L. J. Fogel and P. J. Angeline and D. B. Fogel},\n" +
               "  TITLE =        {An evolutionary programming approach to self-adaptation\n" +
               "                    on finite state machines},\n" +
               "  BOOKTITLE =    {Proceedings of the Fourth International Conference on\n" +
               "                    Evolutionary Programming},\n" +
               "  YEAR =         {1995},\n" +
               "  pages =        {355--365}\n" +
               "}\n" +
               "\n" +
               "@ARTICLE{Goldberg91,\n" +
               "  AUTHOR =       {D. Goldberg},\n" +
               "  TITLE =        {Real-coded genetic algorithms, virtual alphabets, and blocking},\n" +
               "  JOURNAL =      {Complex Systems},\n" +
               "  YEAR =         {1991},\n" +
               "  pages =        {139--167}\n" +
               "}\n" +
               "\n" +
               "@INPROCEEDINGS{Yao96,\n" +
               "  AUTHOR =       {X. Yao and Y. Liu},\n" +
               "  TITLE =        {Fast evolutionary programming},\n" +
               "  BOOKTITLE =    {Proceedings of the 6$^{th}$ Annual Conference on Evolutionary\n" +
               "                    Programming},\n" +
               "  YEAR =         {1996},\n" +
               "  pages =        {451--460}\n" +
               "}";

Output 输出量

INPROCEEDINGS   Fogel95      AUTHOR      L. J. Fogel and P. J. Angeline and D. B. Fogel
INPROCEEDINGS   Fogel95      TITLE       An evolutionary programming approach to self-adaptation on finite state machines
INPROCEEDINGS   Fogel95      BOOKTITLE   Proceedings of the Fourth International Conference on Evolutionary Programming
INPROCEEDINGS   Fogel95      YEAR        1995
INPROCEEDINGS   Fogel95      pages       355--365
ARTICLE         Goldberg91   AUTHOR      D. Goldberg
ARTICLE         Goldberg91   TITLE       Real-coded genetic algorithms, virtual alphabets, and blocking
ARTICLE         Goldberg91   JOURNAL     Complex Systems
ARTICLE         Goldberg91   YEAR        1991
ARTICLE         Goldberg91   pages       139--167

作为最好的,我可以告诉大家,安德烈亚斯的解决方案可能是更好的,但如果你想只是打破了整个字符串到一个数组正则表达式的字符串,您可以使用此: @(.*){(.*),\\s*(.*?)\\s*=\\s*{(.*?)},(?:\\s*(.*) =\\s*{([\\s\\S]*?)},)*?(?:\\s*?(.*?) =\\s*?{(.*?)})*?\\s*?}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM