简体   繁体   English

使用正则表达式在JavaScript中拆分字符串,零宽后向

[英]Split string in JavaScript using regex with zero width lookbehind

I know JavaScript regular expressions have native lookaheads but not lookbehinds. 我知道JavaScript正则表达式具有本机先行功能,但不具有落后功能。

I want to split a string at points either beginning with any member of one set of characters or ending with any member of another set of characters. 我想在以一组字符的任何成员开头或以另一组字符的任何成员结尾的点处拆分字符串。

Split before , , , , . 之前拆分 Split after . 之后

In: ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ

Out: ເລື້ອຍໆມະ ຫັດສະ ຈັນ ເອກອັກຄະ ລັດຖະ ທູດ

I can achieve the "split before" part using zero-width lookahead: 我可以使用零宽预读来实现“之前拆分”部分:

'ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ'.split(/(?=[ໃໄໂເແ])/)

["ເລື້ອຍໆມະຫັດສະຈັນ", "ເອກອັກຄະລັດຖະທູດ"]

But I can't think of a general approach to simulating zero-width lookbehind 但是我想不出一种通用的方法来模拟零宽度的后视

I'm splitting strings of arbitrary Unicode text so don't want to substitute in special markers in a first pass , since I can't guarantee the absence of any string from my input. 我正在分割任意Unicode文本的字符串,因此不想在第一遍中替换特殊标记 ,因为我不能保证输入中没有任何字符串。

Instead of split ing, you may consider using the match() method. 除了split ,您可以考虑使用match()方法。

var s = 'ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ',
    r = s.match(/(?:(?!ະ).)+?(?:ະ|(?=[ໃໄໂເແ]|$))/g);

console.log(r); //=> [ 'ເລື້ອຍໆມະ', 'ຫັດສະ', 'ຈັນ', 'ເອກອັກຄະ', 'ລັດຖະ', 'ທູດ' ]

If you use parentheses in the delimited regex, the captured text is included in the returned array. 如果在带分隔符的正则表达式中使用括号,则捕获的文本将包含在返回的数组中。 So you can just split on /(ະ)/ and then concatenate each of the odd members of the resulting array to the preceding even member. 因此,您可以在/(ະ)/上分割,然后将结果数组的每个奇数成员连接到前面的偶数成员。 Example: 例:

"ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູ".split(/(ະ)/).reduce(function(arr,str,index) {
   if (index%2 == 0) { 
     arr.push(str); 
   } else { 
     arr[arr.length-1] += str
   }; 
   return arr;
 },[])

Result: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນເອກອັກຄະ", "ລັດຖະ", "ທູ"] 结果: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນເອກອັກຄະ", "ລັດຖະ", "ທູ"]

You can do another pass to split on the lookahead: 您可以再进行一次传递以拆分前瞻:

"ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູ".split(/(ະ)/).reduce(function(arr,str,index) {
   if (index%2 == 0) { 
     arr.push(str); 
   } else { 
     arr[arr.length-1] += str
   }; 
   return arr;
 },[]).reduce(function(arr,str){return arr.concat(str.split(/(?=[ໃໄໂເແ])/));},[]);

Result: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນ", "ເອກອັກຄະ", "ລັດຖະ", "ທູ"] 结果: ["ເລື້ອຍໆມະ", "ຫັດສະ", "ຈັນ", "ເອກອັກຄະ", "ລັດຖະ", "ທູ"]

You could try matching rather than splitting, 您可以尝试匹配而不是拆分,

> var re = /((?:(?!ະ).)+(?:ະ|$))/g;
undefined
> var str = "ເລື້ອຍໆມະຫັດສະຈັນເອກອັກຄະລັດຖະທູດ"
undefined
> var m;
undefined
> while ((m = re.exec(str)) != null) {
... console.log(m[1]);
... }
ເລື້ອຍໆມະ
ຫັດສະ
ຈັນເອກອັກຄະ
ລັດຖະ
ທູດ

Then again split the elements in the array using lookahead. 然后使用先行方式再次拆分数组中的元素。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM