简体   繁体   中英

Regex excluding words

Hi All I am new to regex :

I have a string and etc. is considered as end of sentence, how can I make etc. not to be considered as end of sentence in the existing regex.

sentence: 'hello how are you, can you pass me pen, book etc. I am going to travel abroad. I am going on vacation. Let me know if anything needs to be done in something.com.'; 
regex: (/(.*?(?:\.|\?|!))(?: |$)/g);

Current Output :

  • ["hello how are you, can you pass me pen, book etc. ", "I am going to travel abroad. ", "I am going on vacation. ", "Let me know if anything needs to be done in something.com."]

Expected Output:

  • ["hello how are you, can you pass me pen, book etc.I am going to travel abroad. ", "I am going on vacation. ", "Let me know if anything needs to be done in something.com."]

JSfiddle

In the example case it's exceptionally difficult because it would be a valid end of the sentence. The next letter being a capital letter.

Looking ahead to see, not only for the end of line, but also if the next letter is a capital letter would catch most cases:

var sentences = stringSentence.match(/(.*?(?:[.?!])\s*)(?=([A-Z])|$)/g);

But in this example, since I is a capital letter, it would still break. But if a comma and/or a word as 'because' was added after etc., the match would work (and would be grammatically more correct)

If that is not enough, certain exceptions could be added which indicate an abbreviation. Problem is, that abbreviation could actually be at the end of a sentence... For example, I am going on vacation to relax etc. should match.

The easiest way would be to use .. or ... after etc. However, if you can't do that, I would go about it making a specific matching case for etc, since it is indeed a specific case. Try looking at these:

http://regexone.com/lesson/matching_characters (Look at the solution to get an idea)

One possible solution would be this:

(?<![\w\d])etc(?![\w\d])

This would match etc but no words around it, only periods. It would still accept .etc I believe though if that is a problem.

This will do what you want:

([a-zA-Z0-9\ \,]+(?!\ etc)\.)/g

Note that you said not to match "etc.". In this regexp the domain name will be splitted as there is a dot between something and com.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM