提取包含特定單詞的句子

Question

我想在包含特定關鍵字的文本文件中獲取句子。 我做了很多嘗試，但無法獲得包含該關鍵字的正確句子。...如果該段與該段落中的任何一個匹配，則我有一組以上的關鍵字，則應采用。 例如：如果我的文本文件中包含搶劫，搶劫等詞語，則提取該句子的句柄。以下是我嘗試的代碼。 無論如何有使用正則表達式解決此問題。 任何幫助將不勝感激。

  BufferedReader br1 = new BufferedReader(new FileReader("/home/pgrms/Documents/test/one.txt"));
    String str="";

    while(br1 .ready()) 
    {
        str+=br1 .readLine() +"\n";

    }
Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher match = re.matcher(str);
String sentenceString="";
while (match .find())
{
    sentenceString=match.group(0);
    System.out.println(sentenceString);
}

Answer 1

這是當您具有預定義關鍵字列表時的示例：

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.regex.*;
public class Tester {

    public static void main(String [] args){
        try {
            BufferedReader br1 = new BufferedReader(new FileReader("input"));
            String[] words = {"robbery","robbed", "robbers"};
            String word_re = words[0];   
            String str="";

            for (int i = 1; i < words.length; i++)
                word_re += "|" + words[i];
            word_re = "[^.]*\\b(" + word_re + ")\\b[^.]*[.]";
            while(br1.ready()) { str += br1.readLine(); }
            Pattern re = Pattern.compile(word_re, 
                    Pattern.MULTILINE | Pattern.COMMENTS | 
                    Pattern.CASE_INSENSITIVE);
            Matcher match = re.matcher(str);
            String sentenceString="";
            while (match .find()) {
                sentenceString = match.group(0);
                System.out.println(sentenceString);
            }
        } catch (Exception e) {}
    }

}

這將創建以下形式的正則表達式：

[^.]*\b(robbery|robbed|robbers)\b[^.]*[.]

Answer 2

通常，要檢查句子中是否包含rob ， robbery或robbed ，可以在字符串錨點的開頭之后，正則表達式模式的其余部分之前添加lookehead：

(?=.*(?:rob|robbery|robbed))

在這種情況下，將rob分組然后檢查潛在的后綴會更有效：

(?=.*(?:rob(?:ery|ed)?))

在您的Java代碼中，我們可以（例如）修改您的循環，如下所示：

while (match.find())
{
    sentenceString=match.group(0);
    if (sentenceString.matches("(?=.*(?:rob(?:ery|ed)?))")) {
        System.out.println(sentenceString);
    }
}

解釋正則表達式

(?=                      # look ahead to see if there is:
  .*                     #   any character except \n (0 or more times
                         #   (matching the most amount possible))
  (?:                    #   group, but do not capture:
    rob                  #     'rob'
    (?:                  #     group, but do not capture (optional
                         #     (matching the most amount possible)):
      ery                #       'ery'
     |                   #      OR
      ed                 #       'ed'
    )?                   #     end of grouping
  )                      #   end of grouping
)                        # end of look-ahead

Answer 3

看一下ICU項目和icu4j。 它會進行邊界分析，因此它將為您拆分句子和單詞，並將其用於不同的語言。

對於其他內容，您可以將單詞與某個模式匹配（如其他人所建議的），或者將其與您感興趣的一組單詞進行核對。

提取包含特定單詞的句子

問題描述

3 個解決方案

解決方案1
2 已采納 2014-06-06 05:58:29

解決方案2
1 2014-06-06 05:38:05

解決方案3
0 2014-06-06 06:03:49

提取包含特定單詞的句子

問題描述

3 個解決方案

解決方案1 2 已采納 2014-06-06 05:58:29

解決方案2 1 2014-06-06 05:38:05

解決方案3 0 2014-06-06 06:03:49

解決方案1
2 已采納 2014-06-06 05:58:29

解決方案2
1 2014-06-06 05:38:05

解決方案3
0 2014-06-06 06:03:49