简体   繁体   中英

How can I separate a string that contains whitespace between words and punctuation into sentences?

I have the following string:

string = "Mr . john bought greatsite . com for 1 . 5 million dollars , i . e . he paid a lot for it . Did he mind ? Steve jones jr . thinks he didn't . In any case , this isn't true ... Well , with a probability of  . 9 it isn't . What a great site ! I really loved it !!! Did you ???"

I need to split it into sentences like this:

Mr . john bought greatsite . com for 1 . 5 million dollars , i . e . he paid a lot for it . 
Did he mind ? 
Steve jones jr . thinks he didn't .
In any case , this isn't true ...
Well , with a probability of  . 9 it isn't . 
What a great site !
I really loved it !!!
Did you ???

and save them into a list of sentences.

I used the following code:

sents = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s", input_doc2)
print (sents)

The output I get is:

    ['mr .', 'smith bought cheapsite .', 'com for 1 .', '5 million dollars , i .', 'e .', 'he paid a lot for it .', 'did he mind ?', 'adam jones jr .', "thinks he didn't .", "in any case , this isn't true ...", 'well , with a probability of  .', "9 it isn't .", 'what a great movie !', 'i loved it .', 'i loved it !!!', 'did you ???', 'i did .!?', 'not really it was bad !', '']

Which is wrong. It seems like there is no way around this. Is there a way to fix this?

Thanks in advance.

(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s(?=[A-Z])

Try this.See demo.

https://regex101.com/r/sH8aR8/3

sents = re.split(r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s(?=[A-Z])", input_doc2)
print (sents)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM