解析單個句子的正則表達式是什么？

Question

我正在尋找一個很好的.NET正則表達式，我可以用它來解析文本正文中的單個句子。

它應該能夠將以下文本塊解析成六個句子：

Hello world! How are you? I am fine.
This is a difficult sentence because I use I.D.

Newlines should also be accepted. Numbers should not cause  
sentence breaks, like 1.23.

事實證明這比我原先想象的更具挑戰性。

任何幫助將不勝感激。 我將使用它來訓練已知文本體系。

Answer 1

試試這個@"(\\S.+?[.!?])(?=\\s+|$)" ：

string str=@"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex rx = new Regex(@"(\S.+?[.!?])(?=\s+|$)");
foreach (Match match in rx.Matches(str)) {
    int i = match.Index;
    Console.WriteLine(match.Value);
}

結果：

Hello world!
How are you?
I am fine.
This is a difficult sentence because I use I.D.
Newlines should also be accepted.
Numbers should not cause sentence breaks, like 1.23.

當然，對於復雜的，你需要一個真正的解析器，如SharpNLP或NLTK。 我只是一個快速而骯臟的人。

這是SharpNLP信息，其特點是：

SharpNLP是用C＃編寫的自然語言處理工具的集合。 目前它提供以下NLP工具：

句子分割器
一個標記器
詞性標注器
一個chunker（用於“查找非遞歸的句法注釋，如名詞短語塊”）
解析器
一個名字查找器
共同參與工具
WordNet詞匯數據庫的接口

Answer 2

var str = @"Hello world! How are you? I am fine. This is a difficult sentence because I use I.D.
Newlines should also be accepted. Numbers should not cause sentence breaks, like 1.23.";

Regex.Split(str, @"(?<=[.?!])\s+").Dump();

我在LINQPad中測試了這個。

Answer 3

使用正則表達式解析自然語言是不可能的。 一句話的結尾是什么？ 許多地方都可能出現一段時期（egeg）。 您應該使用自然語言解析工具包，例如OpenNLP或NLTK。 不幸的是，C＃中的產品很少（如果有的話）。 因此，您可能必須創建Web服務或以其他方式鏈接到C＃。

請注意，如果您依賴“ID”中的確切空格，將來會導致問題。 您很快就會找到打破正則表達式的示例。 例如，大多數人在他們的內容之后放置空格。

WP中的開放和商業產品有很好的總結（ http://en.wikipedia.org/wiki/Natural_language_processing_toolkits ）。 我們使用了其中幾種。 值得付出努力。

[你用“火車”這個詞。 這通常與機器學習相關（這是NLP的一種方法，並且已經用於句子分割）。 事實上，我提到的工具包包括機器學習。 我懷疑那不是你的意思 - 而是你會通過啟發式來表達你的表達。 別！]

Answer 4

只有正則表達式才能實現這一點，除非你確切知道你有哪些“難”的標記，例如“id”，“Mr.”等。例如，有多少句話是“請顯示你的身份證，先生。鍵。”？ 我不熟悉任何C＃實現，但我使用了NLTK的Punkt標記器。 可能不應該太難以重新實施。

Answer 5

我使用了這里發布的建議，並提出了接縫的正則表達式，以實現我想要做的事情：

(?<Sentence>\S.+?(?<Terminator>[.!?]|\Z))(?=\s+|\Z)

我使用Expresso提出：

//  using System.Text.RegularExpressions;
/// <summary>
///  Regular expression built for C# on: Sun, Dec 27, 2009, 03:05:24 PM
///  Using Expresso Version: 3.0.3276, http://www.ultrapico.com
///  
///  A description of the regular expression:
///  
///  [Sentence]: A named capture group. [\S.+?(?<Terminator>[.!?]|\Z)]
///      \S.+?(?<Terminator>[.!?]|\Z)
///          Anything other than whitespace
///          Any character, one or more repetitions, as few as possible
///          [Terminator]: A named capture group. [[.!?]|\Z]
///              Select from 2 alternatives
///                  Any character in this class: [.!?]
///                  End of string or before new line at end of string
///  Match a suffix but exclude it from the capture. [\s+|\Z]
///      Select from 2 alternatives
///          Whitespace, one or more repetitions
///          End of string or before new line at end of string
///  
///
/// </summary>
public static Regex regex = new Regex(
      "(?<Sentence>\\S.+?(?<Terminator>[.!?]|\\Z))(?=\\s+|\\Z)",
    RegexOptions.CultureInvariant
    | RegexOptions.IgnorePatternWhitespace
    | RegexOptions.Compiled
    );


// This is the replacement string
public static string regexReplace = 
      "$& [${Day}-${Month}-${Year}]";


//// Replace the matched text in the InputText using the replacement pattern
// string result = regex.Replace(InputText,regexReplace);

//// Split the InputText wherever the regex matches
// string[] results = regex.Split(InputText);

//// Capture the first Match, if any, in the InputText
// Match m = regex.Match(InputText);

//// Capture all Matches in the InputText
// MatchCollection ms = regex.Matches(InputText);

//// Test to see if there is a match in the InputText
// bool IsMatch = regex.IsMatch(InputText);

//// Get the names of all the named and numbered capture groups
// string[] GroupNames = regex.GetGroupNames();

//// Get the numbers of all the named and numbered capture groups
// int[] GroupNumbers = regex.GetGroupNumbers();

Answer 6

大多數人都建議使用SharpNLP，你應該這樣做，除非你希望你的QA部門有一個bug。

但是，因為你可能面臨某種壓力。 這是處理像“博士”這樣的詞的另一種嘗試 和“X.”。 但是，它將以一個以“它”結尾的句子失敗。

你好，世界！ 你好嗎？ 我很好。 這是一個難以判斷的句子因為我使用ID Newlines也應該被接受。 數字不應該導致句子中斷，如1.23。 參見B博士或FooBar先生的賁門幽門螺桿菌評估。

    var result = new Regex(@"(\S.+?[.!?])(?=\s+|$)(?<!\s([A-Z]|[a-z]){1,3}.)").Split(input).Where(s => !String.IsNullOrWhiteSpace(s)).ToArray<string>();
    foreach (var match in result) 
    {
        Console.WriteLine(match);
    }

解析單個句子的正則表達式是什么？

問題描述

6 個解決方案

解決方案1
39 已采納 2009-12-20 17:20:11

解決方案2
5 2009-12-20 17:24:08

解決方案3
5 2009-12-20 17:29:47

解決方案4
2 2009-12-20 17:23:38

解決方案5
0 2009-12-27 13:07:19

解決方案6
0 2016-01-05 19:44:00

解析單個句子的正則表達式是什么？

問題描述

6 個解決方案

解決方案1 39 已采納 2009-12-20 17:20:11

解決方案2 5 2009-12-20 17:24:08

解決方案3 5 2009-12-20 17:29:47

解決方案4 2 2009-12-20 17:23:38

解決方案5 0 2009-12-27 13:07:19

解決方案6 0 2016-01-05 19:44:00

解決方案1
39 已采納 2009-12-20 17:20:11

解決方案2
5 2009-12-20 17:24:08

解決方案3
5 2009-12-20 17:29:47

解決方案4
2 2009-12-20 17:23:38

解決方案5
0 2009-12-27 13:07:19

解決方案6
0 2016-01-05 19:44:00