繁体   English   中英

用Java将段落拆分成句子

[英]Splitting a paragraph into sentences in Java

我正在处理一项需要将​​段落拆分成句子的任务。 例如给定一个段落:

"This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool."

我需要以下 4 句话:

This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence.

Sometimes there are problems, i.e. in this one.

here and abbr at the end x.y..

cool

现在它与用 JavaScript 实现的这个任务非常相似。

var re = /\b(\w\.\w\.)|([.?!])\s+(?=[A-Za-z])/g; 
var str = 'This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn\'t split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.';
var result = str.replace(re, function(m, g1, g2){
  return g1 ? g1 : g2+"\r";
});
var arr = result.split("\r");
document.body.innerHTML = "<pre>" + JSON.stringify(arr, 0, 4) + "</pre>";

我试图在此链接的帮助下在 Java 中实现这一点,但遇到了如何在我的 Java 代码中使用上述 snipper 的replace功能的问题。

public static void main(String[] args) {
    String content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.";
    Pattern p = Pattern.compile("/\\b(\\w\\.\\w\\.)|([.?!])\\s+(?=[A-Za-z])/g");
    Matcher m = p.matcher(content);
    List<String> tokens = new LinkedList<String>();
    while (m.find()) {
        String token = m.group(1); // group 0 is always the entire match
        tokens.add(token);
    }

    System.out.println(tokens);
}

如何在 Java 编程中做同样的事情? 对于这个给定的示例文本,是否有比这更好的方法来将一个段落拆分为 Java 中的句子?

public static void main(String[] args) {

    String content = "This is a long string with some numbers 123.456,78 or 100.000 and e.g. some abbreviations in it, which shouldn't split the sentence. Sometimes there are problems, i.e. in this one. here and abbr at the end x.y.. cool.";
    BreakIterator bi = BreakIterator.getSentenceInstance();
    bi.setText(content);
    int index = 0;
    while (bi.next() != BreakIterator.DONE) {
        String sentence = content.substring(index, bi.current());
        System.out.println(sentence);
        index = bi.current();
    }
}

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM