簡體   English   中英

Java正則表達式將帶有值及其度量標准的句子拆分為單個單詞的句子

[英]Java regex to split words in a sentences with value and its metric as single word

我正在嘗試將一個句子拆分為一組單詞。 我正在尋找的是在對數字進行分塊時還要考慮指標。

E.g (Made-up).
 document= The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.

所需要的是一組單詞

the
root
cause
problem
...
40 degrees
30 percent
1.67 gpm
1-19666 tablet
3 hrs

我試過的是

List<String> bagOfWords = new ArrayList<>();    
String [] words = StringUtils.normalizeSpace(document.replaceAll("[^0-9a-zA-Z_.-]", " ")).split(" ");
for(String word :words){
    bagOfWords.add(StringUtils.normalizeSpace(word.replaceAll("\\.(?!\\d)", " ")));         
    }                
System.out.println("NEW 2 :: " + bagOfWords.toString());

假設一個包含數字的單詞之后是另一個不包含數字的單詞。 然后是以下代碼:

    private static final String DOC = "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs";

   // ...

    Pattern pattern = Pattern.compile("(\\b\\S*\\d\\S*\\b\\s+)?\\b\\S+\\b");
    Matcher matcher = pattern.matcher(DOC);
    List<String> words = new ArrayList<>();
    while (matcher.find()) {
        words.add(matcher.group());
    }
    for (String word : words) {
        System.out.println(word);
    }

說明:

  • \\\\b查找單詞邊界
  • \\\\S是非空格字符。 因此,您可以在一個單詞中包含所有內容,例如點或逗號。
  • (...)? 是第一個可選部分。 它用數字捕獲單詞(如果有)。 即它有一些字符( \\\\S* ),然后有一個數字( \\\\d ),然后又有一些字符( \\\\S*
  • 第二個單詞很簡單:至少一個非空白字符。 因此,它在S后面有+ ,而不是*

您所質疑的范圍有點大,但是這里有一個技巧可以適用於這種格式的大多數句子。

首先,您可以創建一個前綴列表,其中包含您單位的關鍵字,例如hrs, tablet, gpm ...一旦有了這些,您就很容易挑選所需的內容。

    String document= "The root cause of the problem is the temperature, it is currently 40 degrees which is 30 percent likely to turn into an infection doctor has prescribed 1-19666 tablet which contains 1.67 gpm and has advised to consume them every 3 hrs.";
    if(document.endsWith(".")){
        document = document.substring(0, document.length() -1 );
    }
    System.out.println(document);
    String[] splitted = document.split(" ");
    List<String> keywords = new ArrayList();
    keywords.add("degrees");
    keywords.add("percent");
    keywords.add("gpm");
    keywords.add("tablet");
    keywords.add("hrs");

    List<String> words = new ArrayList();

    for(String s : splitted){
        if(!s.equals(",")){
            //if s is not a comma;
            if(keywords.contains(s) && words.size()!=0){
                //if s is a keyword append to last item in list
                int lastIndex = words.size()-1;
                words.set(lastIndex, words.get(lastIndex)+" "+s);
            }
            else{
                words.add(s);
            }
        }
    }
    for(String s : words){
        System.out.println(s);
    }

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM