简体   繁体   中英

How to Combine dot(.) that has no spaces thereafter? [Regular Expression]

This is my code [^\\.!\\?]+[!\\?\\.]

I want to separate every sentence perfectly in a post. I am using javascript regex. The problem is when the dot(.) is between characters without spaces so they are separated when they should be merged.

For example: " Apa yang terjadi? Aku terkena musibah! Uang saya 90.000 dicuri maling. "

Uang saya 90.

and

000 dicuri maling.

should merge into

Uang saya 90.000 dicuri maling.

See attached picture below

正则表达式测试仪

Try ([.!?])\\s to create array like the following:

 let str = "Apa yang terjadi? Test test test. Aku terkena musibah! Uang saya 90.000 dicuri maling." str = str.split(/([.!?])\\s/g); let res = []; for(let i=0; i <= str.length; i=i+2){ let x = str.length-1 > i? str[i+1] : ''; let newstr = str[i] + x; res.push(newstr); } console.log(res); 

This should work in most occasions.

(?=[^ ]|^).+?[?!.](?= |$|\n)

Checked here: https://regexr.com/

Even better, you can use the following syntax that will accept several spaces and other blank characters after the sentence ending character and the leading blank characters will not be part of the string that will be extracted!!!

[^\s].+?[?!.](?=\s+|$)

Limitations:

  • for example 10 BC and other abbreviations will be detected as sentence...
  • strings like: terkena musibah!Uang saya 90.000 dicuri maling. will be detected as one sentence...

New version:

I have adapted the regex in the following way, to solve the limitations of the regex proposed so far:

[^\s.!?][a-zA-Z@#$%^&,;"':*()-_+=/\\|{}><()[\]\s\d]*?([?!]|((?<=[^A-Z])\.(?=[^0-9])))

and I have test it on the following text:

 Apa ya{ng terjadi? Ak[u +10 BC ter,ke]na 10.3 mus}ibah.Uang say\\a 90!000 dic&uri ma|ling. Apa yang te*r(j)adi? Aku terkena mus%ibah! Uang sa^ya 90.000 dicuri maling. ter;ke|na mus-ibah?uang saya 90..000 dicuri m"aling. ter@kena mus+ibah!ua=ng say$a 90?000 dicuri ma'ling. terk\\ena mus#ibah.uang saya 90.000 dicuri maling. Apa yang terjadi? Aku 10 BC terke\\na mu/sibah.Uang saya 90!000 dicuri maling. Apa yang terjadi? Aku -10 BC terke\\na mu/sibah. Uang saya 90!000 dicuri maling. 

Advantages:

Abbreviations are preserved: Ak[u +10 BC ter,ke]na 10.3 mus}ibah. is seen as one sentence, preserving the BC

terkena musibah!Uang saya 90.000 dicuri maling. would be separated in two sentences: terkena musibah! and Uang saya 90.000 dicuri maling.

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM