简体   繁体   English

提取包含特定单词的句子

[英]extracting sentences which contain a particular word

I want to get the sentences in a textfile which contain a particular keyword. 我想在包含特定关键字的文本文件中获取句子。 I tried a lot but not able to get the proper sentences that contain the keyword....I have more that one set of keywords if any of this match with the paragraph then it should be taken. 我做了很多尝试,但无法获得包含该关键字的正确句子。...如果该段与该段落中的任何一个匹配,则我有一组以上的关键字,则应采用。 For eg :if my text file contains words like robbery,robbed etc then that sentence shold be extracted.. Below is the code which I tried. 例如:如果我的文本文件中包含抢劫,抢劫等词语,则提取该句子的句柄。以下是我尝试的代码。 Is there anyway to solve this using regex. 无论如何有使用正则表达式解决此问题。 Any help will be appreciated. 任何帮助将不胜感激。

  BufferedReader br1 = new BufferedReader(new FileReader("/home/pgrms/Documents/test/one.txt"));
    String str="";

    while(br1 .ready()) 
    {
        str+=br1 .readLine() +"\n";

    }
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher match = re.matcher(str);
String sentenceString="";
while (match .find())
{
    sentenceString=match.group(0);
    System.out.println(sentenceString);
}

Here is an example for when you have a list of predefined keywords: 这是当您具有预定义关键字列表时的示例:

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.*;
public class Tester {

    public static void main(String [] args){
        try {
            BufferedReader br1 = new BufferedReader(new FileReader("input"));
            String[] words = {"robbery","robbed", "robbers"};
            String word_re = words[0];   
            String str="";

            for (int i = 1; i < words.length; i++)
                word_re += "|" + words[i];
            word_re = "[^.]*\\b(" + word_re + ")\\b[^.]*[.]";
            while(br1.ready()) { str += br1.readLine(); }
            Pattern re = Pattern.compile(word_re, 
                    Pattern.MULTILINE | Pattern.COMMENTS | 
                    Pattern.CASE_INSENSITIVE);
            Matcher match = re.matcher(str);
            String sentenceString="";
            while (match .find()) {
                sentenceString = match.group(0);
                System.out.println(sentenceString);
            }
        } catch (Exception e) {}
    }

}

This creates a regex of the form: 这将创建以下形式的正则表达式:

[^.]*\b(robbery|robbed|robbers)\b[^.]*[.]

In general, to check if a sentence contains rob or robbery or robbed , you can add a lookehead after the beginning of string anchor, before the rest of your regex pattern: 通常,要检查句子中是否包含robrobberyrobbed ,可以在字符串锚点的开头之后,正则表达式模式的其余部分之前添加lookehead:

(?=.*(?:rob|robbery|robbed))

In this case, it is more efficient to group the rob then check for potential suffixes: 在这种情况下,将rob分组然后检查潜在的后缀会更有效:

(?=.*(?:rob(?:ery|ed)?))

In your Java code, we can (for instance) modify your loop like this: 在您的Java代码中,我们可以(例如)修改您的循环,如下所示:

while (match.find())
{
    sentenceString=match.group(0);
    if (sentenceString.matches("(?=.*(?:rob(?:ery|ed)?))")) {
        System.out.println(sentenceString);
    }
}

Explain Regex 解释正则表达式

(?=                      # look ahead to see if there is:
  .*                     #   any character except \n (0 or more times
                         #   (matching the most amount possible))
  (?:                    #   group, but do not capture:
    rob                  #     'rob'
    (?:                  #     group, but do not capture (optional
                         #     (matching the most amount possible)):
      ery                #       'ery'
     |                   #      OR
      ed                 #       'ed'
    )?                   #     end of grouping
  )                      #   end of grouping
)                        # end of look-ahead

Take a look at the ICU Project and icu4j. 看一下ICU项目和icu4j。 It does boundary analysis, so it splits sentences and words for you, and will do it for different languages. 它会进行边界分析,因此它将为您拆分句子和单词,并将其用于不同的语言。

For the rest, you can either match the words against a Pattern (as others have suggested), or check it against a Set of the words you're interested in. 对于其他内容,您可以将单词与某个模式匹配(如其他人所建议的),或者将其与您感兴趣的一组单词进行核对。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM