简体   繁体   English

字符串替换正则表达式匹配的捕获组,除非不同的正则表达式匹配 JS 中的相同捕获组

[英]String replace regex match's capture group unless a different regex matches the same capture group in JS

This is about the simplest example I can come up with (and actually not a far cry from the actual use case) that I anticipate people might not tell me to simply change the regex itself, but here goes.这是我能想出的最简单的例子(实际上与实际用例相差不远),我预计人们可能不会告诉我简单地更改正则表达式本身,但这里是这样。

Say I have these lovely nonsensical input words: sent seimk semek t͡ʃeno eint͡ɬi em t͡ʃeʃeimp t͡ɬent͡ɬien keinsen假设我有这些可爱的无意义输入词: sent seimk semek t͡ʃeno eint͡ɬi em t͡ʃeʃeimp t͡ɬent͡ɬien keinsen

And I want to search replace all instances of ei OR e with a before n or m , which (ei|e)(?:n|m) matches, UNLESS我想搜索替换ei OR e的所有实例与nm之前a(ei|e)(?:n|m)匹配,除非

  1. it is directly preceded (ie no negative lookbehinds, since they'll cause a false negative) by t͡ɬ , t͡ʃ , or d͡ʒ , so ie (?:t͡ɬ|t͡ʃ|d͡ʒ)(ei|e) , or直接位于t͡ɬt͡ʃd͡ʒ之前(即没有否定的后视,因为它们会导致假阴性),因此即(?:t͡ɬ|t͡ʃ|d͡ʒ)(ei|e) ,或

  2. if the n/m is directly succeeded (ie no negative lookaheads, for the same reason) by p , t , or k , ie if (ei|e)(?:n|m)(?:p|t|k) matches如果 n/m 被ptk直接成功(即没有负前瞻,出于同样的原因),即 if (ei|e)(?:n|m)(?:p|t|k)火柴

The desired output is therefore sent seimk samek t͡ʃeno ant͡ɬi am t͡ʃeʃeimp t͡ɬent͡ɬian kansan .所需的 output 因此sent seimk samek t͡ʃeno ant͡ɬi am t͡ʃeʃeimp t͡ɬent͡ɬian kansan

So if the "find and replace" function in JS is String.replace(RegExp pattern, String replacement) , then then you would have to compress all 3 regexes into 1 for the first argument, which 1) I don't think is possible?因此,如果 JS 中的“查找和替换” function 是String.replace(RegExp pattern, String replacement) ,那么您必须将所有 3 个正则表达式压缩为第一个参数的 1 个,这 1) 我认为不可能? It would require negative non-capturing groups which... aren't a thing, right?这将需要消极的非捕获群体......不是一个东西,对吧? and 2) in the actual use case, the regex patterns are generated by a text parser, not by hand, and I don't have enough faith in my ability to write a parser smart enough to optimize. 2)在实际用例中,正则表达式模式是由文本解析器生成的,而不是手动生成的,而且我对自己编写足够聪明的解析器进行优化的能力没有足够的信心。

The other way I thought about doing this was to simply cache all the matches to the first regex in a dictionary, and then remove everything that should be ruled out by the other two, but when you look at the output:我考虑这样做的另一种方法是简单地将所有匹配项缓存到字典中的第一个正则表达式,然后删除其他两个应该排除的所有内容,但是当您查看 output 时:

 let sEnv = "(ei|e)(?:n|m)"; let reEnv = new RegExp(sEnv, "g"); let sExc1 = "(?:t͡ɬ|t͡ʃ|d͡ʒ)(ei|e)"; let reExc1 = new RegExp(sExc1, "g"); let sExc2 = "(ei|e)(?:n|m)(?:p|t|k)"; let reExc2 = new RegExp(sExc2, "g"); let sWords = "sent seimk semek t͡ʃeno eint͡ɬi em t͡ʃeʃeimp t͡ɬent͡ɬien keinsen"; let tWords = sWords.split(" "); for (i = 0; i < tWords.length; i++){ let sCurrentWord = tWords[i]; let result; let tMatches = {} // first cache all the matches (ignoring the exceptions) while (result = reEnv.exec(sCurrentWord)) { tMatches[result.index] = result[0].length; } // then remove the exceptions while (result = reExc1.exec(sCurrentWord)) { delete tMatches[result.index]; } while (result = reExc2.exec(sCurrentWord)) { delete tMatches[result.index]; } // then apply all remaining matches let sOutput = sCurrentWord; for (var index in tMatches){ console.log(sCurrentWord+": starting at "+index+", "+tMatches[index]+" chars long"); } }

It's... a bit confused.这……有点糊涂了。 Not only is it apparently capturing the n despite being in a non-capturing group (result length printed keeps being 2 when the thing I want to capture is only 1 char long), but it also filtered out eint͡ɬi but kept t͡ʃeno as matches, which is the opposite of what it's supposed to do - and as the length of the input list grows this method must get quite slow.尽管在非捕获组中,它不仅明显捕获了n (当我想要捕获的东西只有 1 个字符长时,打印的结果长度一直为 2),而且它还过滤掉了eint͡ɬi但将t͡ʃeno保留为匹配项,这与它应该做的相反 - 随着输入列表的长度增长,这个方法必须变得非常慢。

How else can I go about this find-and-replace?关于这个查找和替换,我还能如何 go?

You actually have more hidden requirement here, namely the p , t , k cannot be followed with a diacritic mark.您实际上在这里有更多隐藏的要求,即ptk后面不能带有变音符号。

Your main mistake is that you think that lookarounds cannot be used to match locations immediately preceded/followed with some pattern.您的主要错误是您认为环视不能用于匹配紧接在某些模式之前/之后的位置。 In fact, lookarounds DO and ALWAYS DO match locations that are immediately preceded/followed with some patterns .事实上, lookarounds DO 和 ALWAYS DO 匹配紧接在某些模式之前/之后的位置

In your case, you can use (assuming you are using the ECMAScript 2018 compliant RegExp ):在您的情况下,您可以使用(假设您使用的是符合 ECMAScript 2018 的RegExp ):

text = text.replace(/(?<!t͡ɬ|t͡ʃ|d͡ʒ)(ei?)(?=[nm](?![ptk](?!\p{M})))/gu, 'a')

See the regex demo .请参阅正则表达式演示 Details:细节:

  • (?<!t͡ɬ|t͡ʃ|d͡ʒ) - a negative lookbehind that fails the match if there is t͡ɬ , t͡ʃ or d͡ʒ immediately to the left of the current location (?<!t͡ɬ|t͡ʃ|d͡ʒ) - 如果当前位置的左侧紧邻t͡ɬt͡ʃd͡ʒ ,则匹配失败
  • (ei?) - ei or e (ei?) - eie
  • (?=[nm](??[ptk](?!\p{M}))) - a positive lookahead that matches a location that is immediately followed with n or m that are not immediately followed with p , t nor k that are not immediately followed with any diacrtic mark ( \p{M} ). (?=[nm](??[ptk](?!\p{M}))) - 一个正向前瞻,它匹配紧跟在nm之后的位置,而不是紧跟ptk后面没有紧跟任何变音符号( \p{M} )。

See the JavaScript demo:请参阅 JavaScript 演示:

 const regex = /(?<?t͡ɬ|t͡ʃ|d͡ʒ)(ei?)(?=[nm](?;[ptk](;.\p{M})))/gu. const text = 'sent seimk semek t͡ʃeno eint͡ɬi em t͡ʃeʃeimp t͡ɬent͡ɬien keinsen', console;log(text.replace(regex, 'a'));

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM