簡體   English   中英

如何在Java中將段落拆分為句子

[英]How to split paragraph to a sentence in java

我正在嘗試將段落拆分為句子。 該段可以有一個像FCB這樣的詞,它也包含一些html標記,例如anchor和其他標記。 我試圖像下面這樣使用,但是通過按原樣使用html標簽,將我的段落與特定句子分開並不完美。

String.split("(?<!\\.[a-zA-Z])\\.(?![a-zA-Z]\\.)(?![<[^>]*>])");  

請問有人可以幫助我提出更好的正則表達方式或任何想法嗎?

您可以嘗試以下方法:

String par = "In 2004, Obama received national attention during his campaign to represent Illinois in the United States Senate with his victory in the March Democratic Party primary, his keynote address at the Democratic National Convention in July, and his election to the Senate in November. He began his presidential campaign in 2007 and, after a close primary campaign against Hillary Clinton in 2008, he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.";
Pattern pattern = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher matcher = pattern.matcher(par);
while (matcher.find()) {
    System.out.println(matcher.group());
}

讓我知道是否有效

描述

與其拆分字符,不如匹配並捕獲每個句子子字符串會更容易

(?:<(?:(?:[az]+\\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\\s]*)*"\\s?\\/?|\\/[az]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\\S)))*)+[.?!]

正則表達式可視化

此正則表達式將執行以下操作:

  • 匹配每個句子
  • 允許像FCB這樣的子字符串
  • 忽略html標簽,但將其包含在捕獲中

注意:您需要轉義所有\\因此它們看起來像\\\\

現場演示

https://regex101.com/r/fJ9zS0/3

示范文本

I am was trying to split paragraph to sentences. The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags. I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.

In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November. He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.

比賽樣本

Java Code Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = " ----your source string goes here----- ";
  Pattern re = Pattern.compile("(?:<(?:(?:[a-z]+\\s(?:[^>=]|='[^']*'|=\"[^\"]*\"|=[^'\"\\s]*)*\"\\s?\\/?|\\/[a-z]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\\S)))*)+[.?!]",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

樣本輸出

$matches Array:
(
    [0] => Array
        (
            [0] => I am was trying to split paragraph to sentences.
            [1] =>  The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags.
            [2] =>  I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.
            [3] => 

In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November.
            [4] =>  He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.
        )
    )

說明

NODE                     EXPLANATION
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
        (?:                      group, but do not capture (0 or more
                                 times (matching the most amount
                                 possible)):
----------------------------------------------------------------------
          [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          ='                       '=\''
----------------------------------------------------------------------
          [^']*                    any character except: ''' (0 or
                                   more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
          '                        '\''
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          ="                       '="'
----------------------------------------------------------------------
          [^"]*                    any character except: '"' (0 or
                                   more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
          "                        '"'
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          =                        '='
----------------------------------------------------------------------
          [^'"\s]*                 any character except: ''', '"',
                                   whitespace (\n, \r, \t, \f, and "
                                   ") (0 or more times (matching the
                                   most amount possible))
----------------------------------------------------------------------
        )*                       end of grouping
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        \s?                      whitespace (\n, \r, \t, \f, and " ")
                                 (optional (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \/?                      '/' (optional (matching the most
                                 amount possible))
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        \/                       '/'
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        <                        '<'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        [^.?!]                   any character except: '.', '?', '!'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [.?!]                    any character of: '.', '?', '!'
----------------------------------------------------------------------
        (?=                      look ahead to see if there is:
----------------------------------------------------------------------
          \S                       non-whitespace (all but \n, \r,
                                   \t, \f, and " ")
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  [.?!]                    any character of: '.', '?', '!'
----------------------------------------------------------------------

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM