简体   繁体   中英

Split a (Bengali) string by a regex without consuming the delimiter, zero-width forward lookahead not working

Actually I can't get it to work in English either. I'm looking for an expression that combines two regexes:

const string = 'Joe shouted, \"The pitchforks are coming!\"';

zeroWidthSplitter = /(?=e)/g;  //splits before every "e"

string.split(zeroWidthSplitter) == 
    ["Jo", "e shout", "ed, \"Th", "e pitchforks ar", "e coming!"]  //true

wordRegex = /[a-zA-Z]+/g;    // matches all English letters, 
                             // discarding spaces and punctuation

string.match(wordRegex) ==
    ["Joe", "shouted", "The", "pitchforks", "are", "coming"]

what I want is a zeroWidthWordDelimiter such that it behaves like splitting with word boundary, keeping spacing and punctuation seperate to words:

string.split(/(?:\b)/gm);

//the string is split strictly with words and non-words

0: "Joe"
1: " "
2: "shouted"
3: ", \""
4: "The"
5: " "
6: "pitchforks"
7: " "
8: "are"
9: " "
10: "coming"
11: "!"

but I wish to split a string of foreign (Bengali) characters and these characters are not recognised by word boundaries.
I can group the words successfully, and group the gaps successfully by putting all bengali letters in a [character class]+
Wiktor's suggestion is a big improvement, to separate the characters into a non-capturing group (?:a|b|c|d). This successfully groups words, but loses punctuation.
As does Peter's even slicker Regex /[^\p{Script=Bengali}]+/u which uses Unicode Property Escapes

 BengaliString = 'হঠাৎ একটা মেয়ে বাকি দু'জন কে কানে কানে বললো,“আমি যেমন টা করবো তোরা সেরকম আমার সাথে থাকবি ।”'; const BengaliRegex = /[ড়ঢ়ঁংঃঅআইঈউঊঋঌএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফববভমমযরলশষসহািীুূৃৄেৈোৌ্ৎড়ঢ়য়]+/gm; //groups words const BengaliGapsRegex = /[^ড়ঢ়ঁংঃঅআইঈউঊঋঌএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফববভমমযরলশষসহািীুূৃৄেৈোৌ্ৎড়ঢ়য়]+/gm; //groups gaps const BengaliDelimiter = /(?=[^ড়ঢ়ঁংঃঅআইঈউঊঋঌএঐওঔকখগঘঙচছজঝঞটঠডঢণতথদধনপফববভমমযরলশষসহািীুূৃৄেৈোৌ্ৎড়ঢ়য়]+)/gm; //zero width but breaks apart many words const BengaliRegexWiktor = /(?:ড়|ঢ|়|ঁ|ং|ঃ|অ|আ|ই|ঈ|উ|ঊ|ঋ|ঌ|এ|ঐ|ও|ঔ|ক|খ|গ|ঘ|ঙ|চ|ছ|জ|ঝ|ঞ|ট|ঠ|ড|ঢ|ণ|ত|থ|দ|ধ|ন|প|ফ|ব|ব|ভ|ম|ম|য|র|ল|শ|ষ|স|হ|া|ি|ী|ু|ূ|ৃ|ৄ|ে|ৈ|ো|ৌ|্|ৎ|ড়|ঢ়|য়)+/mg //groups words perfectly const BengaliSplitterWiktor = /(?=(?:ড়|ঢ|়|ঁ|ং|ঃ|অ|আ|ই|ঈ|উ|ঊ|ঋ|ঌ|এ|ঐ|ও|ঔ|ক|খ|গ|ঘ|ঙ|চ|ছ|জ|ঝ|ঞ|ট|ঠ|ড|ঢ|ণ|ত|থ|দ|ধ|ন|প|ফ|ব|ব|ভ|ম|ম|য|র|ল|শ|ষ|স|হ|া|ি|ী|ু|ূ|ৃ|ৄ|ে|ৈ|ো|ৌ|্|ৎ|ড়|ঢ়|য়)+)/gm; //doesn't group multiple letters using + const BengaliRegexPeter = /[^\p{Script=Bengali}]+/u; //beautiful. but doesn't keep punctuation and spacing console:log("Bengali Gaps Regex. " + BengaliString;split(BengaliGapsRegex)). console:log("Regex Delimiter. " + BengaliString;split(BengaliDelimiter)). console:log("Bengali Regex Wiktor. " + BengaliString;match(BengaliRegexWiktor)). console:log("Bengali Splitter Wiktor. "+ BengaliString;split(BengaliSplitterWiktor)). console:log("Regex Peter. " + BengaliString;split(BengaliRegexPeter));

["

I ended up with this very ugly solution (but it works):<\/i>

Use this regex pattern for detecting Bangla letters:

/[\u0985-\u0994\u0995-\u09a7-\u09a8-\u09ce\u0981\u0982\u0983\u09e6-\u09ef-]/g

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM