如何在Java中将段落拆分为句子

Question

I am was trying to split paragraph to sentences. 我正在尝试将段落拆分为句子。 The paragraph can have a word like FCB also it includes some html tag like anchor and other tags. 该段可以有一个像FCB这样的词，它也包含一些html标记，例如anchor和其他标记。 I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is. 我试图像下面这样使用，但是通过按原样使用html标签，将我的段落与特定句子分开并不完美。

String.split("(?<!\\.[a-zA-Z])\\.(?![a-zA-Z]\\.)(?![<[^>]*>])");

Please is there anyone which can help me with a better regular expression or any idea? 请问有人可以帮助我提出更好的正则表达方式或任何想法吗？

Answer 1

you can try this: 您可以尝试以下方法：

String par = "In 2004, Obama received national attention during his campaign to represent Illinois in the United States Senate with his victory in the March Democratic Party primary, his keynote address at the Democratic National Convention in July, and his election to the Senate in November. He began his presidential campaign in 2007 and, after a close primary campaign against Hillary Clinton in 2008, he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.";
Pattern pattern = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
Matcher matcher = pattern.matcher(par);
while (matcher.find()) {
    System.out.println(matcher.group());
}

let me know if it works 让我知道是否有效

Answer 2

Description 描述

Rather than splitting on the characters, it would be easier to just match and capture each sentence substring 与其拆分字符，不如匹配并捕获每个句子子字符串会更容易

(?:<(?:(?:[az]+\\s(?:[^>=]|='[^']*'|="[^"]*"|=[^'"\\s]*)*"\\s?\\/?|\\/[az]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\\S)))*)+[.?!]

正则表达式可视化

This regular expression will do the following: 此正则表达式将执行以下操作：

Match each sentence 匹配每个句子
allow substrings like FCB 允许像FCB这样的子字符串
ignore html tags, but include them in the capture 忽略html标签，但将其包含在捕获中

Note: You'll need to escape all the \\ so they look like \\\\ 注意：您需要转义所有\\因此它们看起来像\\\\

Example 例

Live Demo 现场演示

https://regex101.com/r/fJ9zS0/3 https://regex101.com/r/fJ9zS0/3

Sample text 示范文本

I am was trying to split paragraph to sentences. The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags. I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.

In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November. He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.

Sample Matches 比赛样本

Java Code Example:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class Module1{
  public static void main(String[] asd){
  String sourcestring = " ----your source string goes here----- ";
  Pattern re = Pattern.compile("(?:<(?:(?:[a-z]+\\s(?:[^>=]|='[^']*'|=\"[^\"]*\"|=[^'\"\\s]*)*\"\\s?\\/?|\\/[a-z]+)>)|(?:(?!<)(?:[^.?!]|[.?!](?=\\S)))*)+[.?!]",Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
  Matcher m = re.matcher(sourcestring);
  int mIdx = 0;
    while (m.find()){
      for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
      }
      mIdx++;
    }
  }
}

Sample Output 样本输出

$matches Array:
(
    [0] => Array
        (
            [0] => I am was trying to split paragraph to sentences.
            [1] =>  The paragraph can have a word like F.C.B also it includes some html tag like anchor and other tags.
            [2] =>  I was trying to use like below but it was not perfect separating my paragraph to the specific sentence by living the html tag as it is.
            [3] => 

In 2004, he <a href="http://test.pic.org/jpeg."> received </a> national attention during his Party primary, his keynote address July, <a onmouseover=" fnRotator('I like droids. '); "> and </a> his election to the Senate in November.
            [4] =>  He began his presidential campaign in he won sufficient delegates in the Democratic Party primaries to receive the presidential nomination.
        )
    )

Explanation 说明

NODE                     EXPLANATION
----------------------------------------------------------------------
  (?:                      group, but do not capture (1 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \s                       whitespace (\n, \r, \t, \f, and " ")
----------------------------------------------------------------------
        (?:                      group, but do not capture (0 or more
                                 times (matching the most amount
                                 possible)):
----------------------------------------------------------------------
          [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          ='                       '=\''
----------------------------------------------------------------------
          [^']*                    any character except: ''' (0 or
                                   more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
          '                        '\''
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          ="                       '="'
----------------------------------------------------------------------
          [^"]*                    any character except: '"' (0 or
                                   more times (matching the most
                                   amount possible))
----------------------------------------------------------------------
          "                        '"'
----------------------------------------------------------------------
         |                        OR
----------------------------------------------------------------------
          =                        '='
----------------------------------------------------------------------
          [^'"\s]*                 any character except: ''', '"',
                                   whitespace (\n, \r, \t, \f, and "
                                   ") (0 or more times (matching the
                                   most amount possible))
----------------------------------------------------------------------
        )*                       end of grouping
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        \s?                      whitespace (\n, \r, \t, \f, and " ")
                                 (optional (matching the most amount
                                 possible))
----------------------------------------------------------------------
        \/?                      '/' (optional (matching the most
                                 amount possible))
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        \/                       '/'
----------------------------------------------------------------------
        [a-z]+                   any character of: 'a' to 'z' (1 or
                                 more times (matching the most amount
                                 possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
      >                        '>'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
        <                        '<'
----------------------------------------------------------------------
      )                        end of look-ahead
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        [^.?!]                   any character except: '.', '?', '!'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [.?!]                    any character of: '.', '?', '!'
----------------------------------------------------------------------
        (?=                      look ahead to see if there is:
----------------------------------------------------------------------
          \S                       non-whitespace (all but \n, \r,
                                   \t, \f, and " ")
----------------------------------------------------------------------
        )                        end of look-ahead
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
  )+                       end of grouping
----------------------------------------------------------------------
  [.?!]                    any character of: '.', '?', '!'
----------------------------------------------------------------------

如何在Java中将段落拆分为句子

问题描述

2 个解决方案

解决方案1
1 2016-06-08 21:42:13

解决方案2
1 2016-06-09 02:12:36

Description 描述

Example 例

Explanation 说明

如何在Java中将段落拆分为句子

问题描述

2 个解决方案

解决方案1 1 2016-06-08 21:42:13

解决方案2 1 2016-06-09 02:12:36

Description 描述

Example 例

Explanation 说明

解决方案1
1 2016-06-08 21:42:13

解决方案2
1 2016-06-09 02:12:36