[英]Regex To Split String Into Sentences
我需要拆分包含句子的字符串,例如:
"this is a sentence. this is another. Rawlings, G. stated foo and bar."
成
["this is a sentence.", "this is another.", "Rawlings, G. stated foo and bar."]
使用正則表達式。
我發現的其他解決方案將第三句分為"Rawlings, G."
和"stated foo and bar."
這不是我想要的。
正則表達式通常不能解決這個問題。
你需要一個句子檢測算法, OpenNLP有一個
它很簡單,可以使用:
String sentences[] = sentenceDetector.sentDetect(yourString);
並處理了許多棘手的案件
通過嵌套的lookbehinds。
只需根據以下正則表達式拆分輸入字符串即可。 下面的正則表達式將根據剛好存在於點之后的邊界分割輸入字符串,並檢查點的前一個字符。 只有當dot的前一個字符不是一個超級字母時,它才會分裂。
String s = "this is a sentence. this is another. Rawlings, G. stated foo and bar.";
String[] tok = s.split("(?<=(?<![A-Z])\\.)");
System.out.println(Arrays.toString(tok));
輸出:
[this is a sentence., this is another., Rawlings, G. stated foo and bar.]
說明:
(?<=(?<![AZ])\\\\.)
匹配剛剛存在於dot之后的邊界,但點之前不會是大寫字母。 我試過這個
import java.text.BreakIterator;
import java.util.Locale;
public class StringSplit {
public static void main(String args[]) throws Exception {
BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a sentence. This is another. Rawlings, G. stated foo and bar.";
iterator.setText(source);
int start = iterator.first();
for ( int end = iterator.next();
end != BreakIterator.DONE;
start = end, end = iterator.next()) {
System.out.println(source.substring(start, end));
}
}
}
out put是
This is a sentence.
This is another.
Rawlings, G. stated foo and bar.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.