将字符串拆分成句子

Question

我已经编写了这段代码，用于拆分字符串并将其存储在字符串数组中：-

String[] sSentence = sResult.split("[a-z]\\.\\s+");

但是，我添加了 [az] 是因为我想处理一些缩写问题。 但后来我的结果显示如下：-

此外，当埃弗雷特试图教他们基础数学时，他们被证明没有反应

我发现我丢失了 split 函数中指定的模式。 丢失句号对我来说是可以的，但是丢失单词的最后一个字母会扰乱其含义。

有人可以帮我解决这个问题吗，此外，有人可以帮我处理缩写吗？ 例如，因为我根据句点拆分字符串，所以我不想丢失缩写。

Answer 1

解析句子远非一项微不足道的任务，即使对于像英语这样的拉丁语言也是如此。 像您在问题中概述的那种天真的方法会经常失败，以至于在实践中证明是无用的。

更好的方法是使用配置了正确 Locale 的BreakIterator 。

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

产生以下结果：

这是一个测试。
这是一个 TLA 测试。
现在里面有一个博士。

Answer 2

很难让正则表达式在所有情况下都能工作，但要解决您的直接问题，您可以使用后视：

String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");

结果：

This is a test
This is a T.L.A. test.

注意有不以大写字母结尾的缩写，如abbrev., Mr.等... 还有不以句号结尾的句子！

Answer 3

如果可以，请使用自然语言处理工具，例如LingPipe 。 使用正则表达式很难捕捉到许多微妙之处，例如 ( eg :-)), Mr. , abbreviations , ellipsis (...)等等。

LingPipe 网站上有一个非常容易学习的关于句子检测的教程。

Answer 4

迟到的回应，但对于像我这样的未来访客以及经过长时间的搜索。 使用 OpenNlP 模型，这是我的最佳选择，它适用于此处的所有文本样本，包括@nbz 在评论中提到的关键文本样本，

My friend, Mr. Jones, has a new dog. This is a test. This is a T.L.A. test. Now with a Dr. in it."

由行空间分隔：

My friend, Mr. Jones, has a new dog.
This is a test.
This is a T.L.A. test.
Now with a Dr. in it.

您需要将.jar库以及经过训练的模型en-sent.bin导入到您的项目中。

这是一个教程，可以轻松地将您整合到快速高效的运行中：

https://www.tutorialkart.com/opennlp/sentence-detection-example-in-opennlp/

还有一个用于在 Eclipse 中进行设置：

https://www.tutorialkart.com/opennlp/how-to-setup-opennlp-java-project/

这是代码的样子：

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
 
import com.fasterxml.jackson.databind.exc.InvalidFormatException;
 
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
 
/**
* Sentence Detection Example in openNLP using Java
* @author tutorialkart
*/
public class SentenceDetectExample {
 
    public static void main(String[] args) {
        try {
            new SentenceDetectExample().sentenceDetect();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    /**
     * This method is used to detect sentences in a paragraph/string
     * @throws InvalidFormatException
     * @throws IOException
     */
    public void sentenceDetect() throws InvalidFormatException, IOException {
        String paragraph = "This is a statement. This is another statement. Now is an abstract word for time, that is always flying.";
 
        // refer to model file "en-sent,bin", available at link http://opennlp.sourceforge.net/models-1.5/
        InputStream is = new FileInputStream("en-sent.bin");
        SentenceModel model = new SentenceModel(is);
        
        // feed the model to SentenceDetectorME class
        SentenceDetectorME sdetector = new SentenceDetectorME(model);
        
        // detect sentences in the paragraph
        String sentences[] = sdetector.sentDetect(paragraph);
 
        // print the sentences detected, to console
        for(int i=0;i<sentences.length;i++){
            System.out.println(sentences[i]);
        }
        is.close();
    }
}

由于您实现了库，它也可以离线工作，这是一个很大的优势，因为@Julien Silland 的正确答案说这不是一个直接的过程，让训练有素的模型为您做这件事是最好的选择。

将字符串拆分成句子

问题描述

4 个解决方案

解决方案1
58 已采纳 2010-04-22 02:42:36

解决方案2
12 2010-04-21 22:32:19

解决方案3
4 2010-04-21 22:43:55

解决方案4
2 2021-01-22 13:50:14

将字符串拆分成句子

问题描述

4 个解决方案

解决方案1 58 已采纳 2010-04-22 02:42:36

解决方案2 12 2010-04-21 22:32:19

解决方案3 4 2010-04-21 22:43:55

解决方案4 2 2021-01-22 13:50:14

解决方案1
58 已采纳 2010-04-22 02:42:36

解决方案2
12 2010-04-21 22:32:19

解决方案3
4 2010-04-21 22:43:55

解决方案4
2 2021-01-22 13:50:14