简体   繁体   English

在 java 中使用正则表达式提取两个特定单词之间的子字符串

[英]Extract sub-string between two certain words using regex in java

I would like to extract sub-string between certain two words using java.我想使用 java 提取某些两个单词之间的子字符串。

For example:例如:

This is an important example about regex for my work.

I would like to extract everything between " an " and " for ".我想提取“ an ”和“ for ”之间的所有内容。

What I did so far is:到目前为止我所做的是:

String sentence = "This is an important example about regex for my work and for me";
Pattern pattern = Pattern.compile("(?<=an).*.(?=for)");
Matcher matcher = pattern.matcher(sentence);

boolean found = false;
while (matcher.find()) {
    System.out.println("I found the text: " + matcher.group().toString());
    found = true;
}
if (!found) {
    System.out.println("I didn't found the text");
}

It works well.它运作良好。

But I want to do two additional things但我想做另外两件事

  1. If the sentence is: This is an important example about regex for my work and for me.如果句子是: This is an important example about regex for my work and for me. I want to extract till the first " for " ie important example about regex我想提取到第一个“ for ”,即important example about regex

  2. Some times I want to limit the number of words between the pattern to 3 words ie important example about有时我想将模式之间的单词数限制为 3 个单词,即important example about

Any ideas please?请问有什么想法吗?

For your first question, make it lazy.对于你的第一个问题,让它变得懒惰。 You can put a question mark after the quantifier and then the quantifier will match as less as possible.你可以在量词后面加上一个问号,然后量词会尽可能少地匹配。

(?<=an).*?(?=for)

I have no idea what the additional .我不知道额外的. at the end is good for in .*.最后对 in .*. its unnecessary.它是不必要的。

For your second question you have to define what a "word" is.对于第二个问题,您必须定义什么是“单词”。 I would say here probably just a sequence of non whitespace followed by a whitespace.我会说这里可能只是一个非空格序列,后跟一个空格。 Something like this像这样的东西

\S+\s

and repeat this 3 times like this像这样重复3次

(?<=an)\s(\S+\s){3}(?=for)

To ensure that the pattern mathces on whole words use word boundaries确保整个单词的模式数学使用单词边界

(?<=\ban\b)\s(\S+\s){1,5}(?=\bfor\b)

See it online here on Regexr在 Regexr 上在线查看

{3} will match exactly 3 for a minimum of 1 and a max of 3 do this {1,3} {3}将精确匹配 3 最少 1 和最多 3 这样做{1,3}

Alternative:选择:

As dma_k correctly stated in your case here its not necessary to use look behind and look ahead.正如 dma_k 在您的案例中正确说明的那样,没有必要使用向后看和向前看。 See here the Matcher documentation about groups请参阅此处有关组的 Matcher 文档

You can use capturing groups instead.您可以改用捕获组。 Just put the part you want to extract in brackets and it will be put into a capturing group.只需将要提取的部分放在括号中,它将被放入捕获组中。

\ban\b(.*?)\bfor\b

See it online here on Regexr在 Regexr 上在线查看

You can than access this group like this你可以像这样访问这个组

System.out.println("I found the text: " + matcher.group(1).toString());
                                                        ^

You have only one pair of brackets, so its simple, just put a 1 into matcher.group(1) to access the first capturing group.您只有一对括号,所以很简单,只需将1放入matcher.group(1)即可访问第一个捕获组。

Your regex is " an\\s+(.*?)\\s+for ".您的正则表达式是“ an\\s+(.*?)\\s+for ”。 It extracts all characters between an and for ignoring white spaces ( \s+ ).它提取 an 和之间的所有字符以忽略空格( \s+ )。 The question mark means "greedy".问号的意思是“贪婪”。 It is needed to prevent pattern .* to eat everything including word "for".需要防止模式.*吃掉包括单词“for”在内的所有东西。

public class SubStringBetween {公共 class SubStringBetween {

public static String subStringBetween(String sentence, String before, String after) {

    int startSub = SubStringBetween.subStringStartIndex(sentence, before);
    int stopSub = SubStringBetween.subStringEndIndex(sentence, after);

    String newWord = sentence.substring(startSub, stopSub);
    return newWord;
}

public static int subStringStartIndex(String sentence, String delimiterBeforeWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0, y = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterBeforeWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterBeforeWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterBeforeWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterBeforeWord)) {
                x = startIndex;
            }
        }
    }
    return x;
}

public static int subStringEndIndex(String sentence, String delimiterAfterWord) {

    int startIndex = 0;
    String newWord = "";
    int x = 0;

    for (int i = 0; i < sentence.length(); i++) {
        newWord = "";

        if (sentence.charAt(i) == delimiterAfterWord.charAt(0)) {
            startIndex = i;
            for (int j = 0; j < delimiterAfterWord.length(); j++) {
                try {
                    if (sentence.charAt(startIndex) == delimiterAfterWord.charAt(j)) {
                        newWord = newWord + sentence.charAt(startIndex);
                    }
                    startIndex++;
                } catch (Exception e) {
                }

            }
            if (newWord.equals(delimiterAfterWord)) {
                x = startIndex;
                x = x - delimiterAfterWord.length();
            }
        }
    }
    return x;
}

} }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM