I feel like I'm pretty close with this one but as soon as I move the punctuation capture to the end of the sentence it misscaptures.
The sentence scenarios are below:
This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. This is a sentence with odd spacing. This is one with lots of exclamation marks at the end!!!!This is another with a decimal 10.00 in the middle. Why is it so hard to find sentence endings?Last sentence without a space at the start.
This should result in captures of:
This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it.
This is a sentence with odd spacing.
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle.
Why is it so hard to find sentence endings?
Last sentence without a space at the start.
This is the expression as I have it:
.*?(?:[!?.;]+)((?<!(Mr|Mrs|Dr|Rev).?)(?=\D|\s+|$)(?:[^!?.;\d]|\d*\.?\d+)*)(?=(?:[!?.;]+))
There are two problems as it stands:
The data going into this will be somewhat normalised so we know that it will end in a full stop and be on a single line but any pointers welcome.
I agree with @spender that using a parser is recommended to do this to filter all punctuation rules.
However, the following will work for your scenarios.
foreach (Match m in Regex.Matches(s, @"(.*?(?<!(?:\b[A-Z]|Mrs?|Dr|Rev|\d))[!?.;]+)\s*"))
Console.WriteLine(m.Groups[1].Value);
Output
This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it.
This is a sentence with odd spacing.
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle.
Why is it so hard to find sentence endings?
Last sentence without a space at the start.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.