简体   繁体   中英

Regex to match sentence with decimals and names

I feel like I'm pretty close with this one but as soon as I move the punctuation capture to the end of the sentence it misscaptures.

The sentence scenarios are below:

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. This is a  sentence      with odd   spacing. This is one with lots of exclamation marks at the end!!!!This is another with a decimal 10.00 in the middle. Why is it so hard to find sentence endings?Last sentence without a space at the start.

This should result in captures of:

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a  sentence      with odd   spacing. 
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings?
Last sentence without a space at the start.

This is the expression as I have it:

.*?(?:[!?.;]+)((?<!(Mr|Mrs|Dr|Rev).?)(?=\D|\s+|$)(?:[^!?.;\d]|\d*\.?\d+)*)(?=(?:[!?.;]+))

There are two problems as it stands:

  1. The punctuation is at the start
  2. It correctly handles one name per sentence but not two (for bonus points I'd like it to correctly capture "Mr DJ Smith" but I can't work out how it wouldn't match sentences ending with a single letter.

The data going into this will be somewhat normalised so we know that it will end in a full stop and be on a single line but any pointers welcome.

I agree with @spender that using a parser is recommended to do this to filter all punctuation rules.

However, the following will work for your scenarios.

foreach (Match m in Regex.Matches(s, @"(.*?(?<!(?:\b[A-Z]|Mrs?|Dr|Rev|\d))[!?.;]+)\s*"))
         Console.WriteLine(m.Groups[1].Value);

Output

This is a sentence with a name like Mr. D. Smith and Mr J. Smith in it. 
This is a  sentence      with odd   spacing. 
This is one with lots of exclamation marks at the end!!!!
This is another with a decimal 10.00 in the middle. 
Why is it so hard to find sentence endings?
Last sentence without a space at the start.

Ideone Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM