简体   繁体   English

正则表达式从段落中查找包含特定单词(java)的句子

[英]Regex to find sentence containing specific word (java) from paragraph

I have a list of words: dog , cat , leopard . 我有一句话: 豹子

I'm trying to come up with a regex in Java to pull out the sentence from a long paragraph that contains any one of the words (case insensitive). 我正在尝试使用Java中的正则表达式从包含任何一个单词(不区分大小写)的长段落中提取该句子。 The sentence ends in . 该句子以结尾. ? or ! ! Could anyone help? 有人可以帮忙吗? Thank you! 谢谢!

The following assumes a sentence starts with a capital letter, and that there are no . 以下假设句子以大写字母开头,并且没有. , ! ! or ? 还是? in the sentence, apart from at the end of it. 在句子中,除了结尾处。

String str = "Hello. It's a leopard I think. How are you? It's just a dog or a cat. Are you sure?";
Pattern p = Pattern.compile("[A-Z](?i)[^.?!]*?\\b(dog|cat|leopard)\\b[^.?!]*[.?!]");
Matcher m = p.matcher(str);

while (m.find()) {
    System.out.println(m.group());
}
// It's a leopard I think.
// It's just a dog or a cat.

Assumptions 假设条件

  • Sentence must start with a capital letter with no line terminators [.?!] in between. 句子必须以大写字母开头,并且中间没有行终止符[。?!]。
  • Keyword match is case insensitive. 关键字匹配不区分大小写。 A sub-string match is not valid though. 但是,子字符串匹配无效。
  • Keywords may appear anywhere in (start, end or in the middle of) the sentence. 关键字可能出现在句子的任何位置(开头,结尾或中间)。
  • Supports quotations and informal double punctuation. 支持引号和非正式的双标点符号。 Use the second regex if not required. 如果不需要,请使用第二个正则表达式。

public class SentenceFinder {

    public static void main(String[] args) {
        String paragraph = "I have a list of words to match: dog, cat, leopard. But blackdog or catwoman shouldn't match. Dog may bark at the start! Is that meow at the end my cat? Some bonus sentence matches shouldn't hurt. My dog gets jumpy at times and behaves super excited!! My cat sees my goofy dog and thinks WTF?! Leopard likes to quote, \"I'm telling you these Lions suck bro!\" Sometimes the dog asks too, \"Cat got your tongue?!\"";
        Pattern p = Pattern.compile("([A-Z][^.?!]*?)?(?<!\\w)(?i)(dog|cat|leopard)(?!\\w)[^.?!]*?[.?!]{1,2}\"?");
        Matcher m = p.matcher(paragraph);
        while (m.find()) {
            System.out.println(m.group());
        }
    }
    /* Output:
       I have a list of words to match: dog, cat, leopard.
       Dog may bark at the start!
       Is that meow at the end my cat?
       My dog gets jumpy at times and behaves super excited!!
       My cat sees my goofy dog and thinks WTF?!
       Leopard likes to quote, "I'm telling you these Lions suck bro!"
       Sometimes the dog asks too, "Cat got your tongue?!"
    */
}


Simplified regex if "Quotes?!" 简化的正则表达式是否为“ Quotes ?!” (or informal punctuation) isn't required: (或非正式标点符号)不是必需的:
"([AZ][^.?!]*?)?(?<!\\\\w)(?i)(dog|cat|leopard)(?!\\\\w)[^.?!]*?[.?!]"

To fetch those sentences as well that don't start with a capital letter (if the input may have such typos): 还要提取不以大写字母开头的句子(如果输入内容可能有此类错别字):
"(?i)([az][^.?!]*?)?(?<!\\\\w)(dog|cat|leopard)(?!\\\\w)[^.?!]*?[.?!]"

this should do it. 这应该做到。 you just have to populate what words you want in the middle. 您只需要在中间填充想要的单词即可。 example: 例:

hello there i am a dog and i love to do things? 你好,我是一只狗,我喜欢做事吗? Don't take my weakness for kindness. 不要以我的弱点为善。 My bark is better than the bite of a leapord! 我的树皮胜过空手! So adopt me over another animal. 所以领养我超过另一只动物。 Like a cat. 像猫一样。

matches: 火柴:

hello there i am a dog and i love to do things? 你好,我是一只狗,我喜欢做事吗? My bark is better than the bite of a leapord! 我的树皮胜过空手! Like a cat. 像猫一样。 and do that (?i) to ignore case. 然后执行(?i)忽略大小写。 i didn't put it in because i don't really remember syntax but someone else wrote it 我没有把它放进去,因为我真的不记得语法,但是别人写的

"(?=.*?\\.)[^ .?!][^.?!]*?(dog|cat|leapord).*?[.?!]"

try this regex 试试这个正则表达式

   str.matches("(?i)(^|\\s+)(dog|cat|leopard)(\\s+|[.?!]$)");

(?i) is a special construct that means case insensitive (?i)是一种特殊的构造,表示不区分大小写

. (cat|dog|leopard). (猫|狗|豹)。 (\\.|\\?|\\!)$ and you should use the CASE_INSENSITIVE option of java.util.regex.Pattern. (\\。| \\ ?? \\\\!)$,则应使用java.util.regex.Pattern的CASE_INSENSITIVE选项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM