简体   繁体   English

Java Regex提取两个令牌之间的任何字符

[英]Java Regex to extract any characters between two tokens

I am trying to parse the following text 我正在尝试解析以下文本

### __Description of the report__
Lorem ipsum dolor sit amet,  & mauris elit, blandit a turpis vel nibh, 
consectetuer aliquam. Nec sem. Venenatis quam etiam donec consequat 
sagittis, luctus porttitor odit sollicitudin <> vestibulum ultrices erat,
sed eleifend 
* amet, sollicitudin sit egestas 
* quis eros nulla. Sed donec

### __Notable filters__
* Lorem ipsum dolor sit amet, mauris elit, blandit a turpis vel
* consectetuer aliquam. Nec sem. Venenatis quam etiam donec consequat 
* sagittis, luctus porttitor odit sollicitudin vestibulum ultrices 

I want to capture all text between the ### __Description of the report__ and ### __Notable filters__ which could be numbers letters, or any combination of special characters. 我想捕获### __Description of the report__### __Notable filters__ ### __Description of the report__之间的所有文本,这些### __Notable filters__可以是数字字母,也可以是特殊字符的任意组合。

I thought using ### __Description of the report__(.*?)### __Notable filters__ would work, but it doesn't return any results. 我以为使用### __Description of the report__(.*?)### __Notable filters__可以使用,但不会返回任何结果。 How can i extract the text between the two headings? 如何提取两个标题之间的文本?

You can use the String's split function and use both headers as the regex, concatenating with the '|' 您可以使用String的split函数,并将两个标头都用作正则表达式,并与“ |”连接 operator. 操作员。

In this way, the content of the first section will be placed in the first element of the array and the content of the second section will be placed in the second element of the array. 这样,第一部分的内容将放置在数组的第一个元素中,而第二部分的内容将放置在数组的第二个元素中。

Please check this code: 请检查以下代码:

public class Test {
    private String testString = "### __Description of the report__\n" +
"Lorem ipsum dolor sit amet,  & mauris elit, blandit a turpis vel nibh, \n" +
"consectetuer aliquam. Nec sem. Venenatis quam etiam donec consequat \n" +
"sagittis, luctus porttitor odit sollicitudin <> vestibulum ultrices erat,\n" +
"sed eleifend \n" +
"* amet, sollicitudin sit egestas \n" +
"* quis eros nulla. Sed donec\n" +
"\n" +
"### __Notable filters__\n" +
"* Lorem ipsum dolor sit amet, mauris elit, blandit a turpis vel\n" +
"* consectetuer aliquam. Nec sem. Venenatis quam etiam donec consequat \n" +
"* sagittis, luctus porttitor odit sollicitudin vestibulum ultrices ";

    public static void main (String[] args)
    {
        Test t = new Test();
        String[] parts = t.testString.split("### __Description of the report__\n|### __Notable filters__\n");
    }
}

Use Pattern.DOTALL : 使用Pattern.DOTALL

Pattern p = Pattern.compile("### __Description of the report__(.*?)### __Notable filters__", Pattern.DOTALL);

Pattern.MULTILINE will match ### __Description of the report__ and ### __Notable filters__ with the start and end of EVERY LINE , so that can't be used. Pattern.MULTILINE将匹配### __Description of the report__### __Notable filters__### __Notable filters__以及EVERY LINE的开头和结尾,因此无法使用。 DOTALL will match . DOTALL将匹配. with each character, including \\n , which won't happen without specifying Pattern.DOTALL . 每个字符,包括\\n ,如果不指定Pattern.DOTALL就不会发生。

To store it, do this: 要存储它,请执行以下操作:

Matcher m = p.matcher(str); // 'str' is the string with the text
while(m.find())
{
    YourString = m.group(1);
}

Later, you can replace the extra spaces like this: 以后,您可以像这样替换多余的空格:

YourString = YourString.replaceAll("\\s+", " ");

Trying out your regex seems to return nothing because of your choice of expression: 由于选择了表达式, 尝试使用正则表达式似乎不会返回任何内容:

... report__(.*?)### __N ... ... report__(.*?)### __N ...

The . . character matches non-newline characters, so either you need to take out newlines in your string before parsing, or change your expression to fit newlines in your input 字符与非换行符匹配,因此您需要在解析之前在字符串中取出换行符,或者更改表达式以使其适合输入中的换行符


@CoffeehouseCoder's answer suggests using Pattern.DOTALL , which will fix this issue by allowing . @CoffeehouseCoder的答案建议使用Pattern.DOTALL ,它将通过允许解决此问题. to match newlines 匹配换行符


Alternatively, you can update your regex to match either a character or a newline like so : 另外,您可以更新正则表达式以匹配字符或换行符, 如下所示

... report__((.|\\n)*?)### ... ... report__((.|\\n)*?)### ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM