[英]How can I split paragraphs into proper sentences in java using split()?
[英]How can I properly count the number of sentences from the file in Java using the split method?
如何准确计算文件中的句子数?
我的文件中有一段文字。 有 7 个句子,但我的代码显示有 9 个句子。
String path = "C:/CT_AQA - Copy/src/main/resources/file.txt";
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path)));
String line;
int countWord = 0;
int sentenceCount = 0;
int characterCount = 0;
int paragraphCount = 0;
int countNotLetter = 0;
int letterCount = 0;
int wordInParagraph = 0;
List<Integer> wordsPerParagraph = new ArrayList<>();
while ((line = br.readLine()) != null) {
if (line.equals("")) {
paragraphCount++;
wordsPerParagraph.add(wordInParagraph);
System.out.printf("In %d paragraph there are %d words\n", paragraphCount, wordInParagraph);
wordInParagraph = 0;
} else {
characterCount += line.length();
String[] wordList = line.split("[\\s—]");
countWord += wordList.length;
wordInParagraph += wordList.length;
String[] letterList = line.split("[^a-zA-Z]");
countNotLetter += letterList.length;
String[] sentenceList = line.split("[.:]");
sentenceCount += sentenceList.length;
}
letterCount = characterCount - countNotLetter;
}
if (wordInParagraph != 0) {
wordsPerParagraph.add(wordInParagraph);
}
br.close();
System.out.println("The amount of words are " + countWord);
System.out.println("The amount of sentences are " + sentenceCount);
System.out.println("The amount of paragraphs are " + paragraphCount);
System.out.println("The amount of letters are " + letterCount);
您的代码看起来可以正常工作,尽管它并未在任何地方遵循最佳实践。
我怀疑得到错误答案的根本原因是计算句子结尾的正则表达式不准确。 您的代码计算以点或冒号结尾的句子。 问题出在这一行:
String[] sentenceList = line.split("[.:]");
但冒号不是句子的结尾,除此之外,句子还以其他字符结尾(感叹号和问号、省略号)。 这种模式在我的评估中更准确:
"[!?.]+(?=$|\\s)"
并显示您得到错误结果的文件内容。 那么我的假设就有可能被说服。
仅计算文件中句子数的完整代码:
int sentenceCount = 0;
while ((line = br.readLine()) != null) {
if (!"".equals(line)) {
String[] sentencesArray = line.split("[!?.]+(?=$|\\s)");
sentenceCount += sentencesArray.length;
}
}
br.close();
System.out.println("The amount of sentences are " + sentenceCount);
您可能会在句子中提取尾随空格,这会为数组添加额外的值。 您可以在使用replaceAll("\\\\s+", "")
split
其split
为句子之前从该line
删除空格。
更改后的代码如下所示:
String[] sentenceList = line.replaceAll("\\s+","").split("[.:]");
然而,我没有改变你定义一个句子的内容, !
和?
显然也可以是句子分隔符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.