简体   繁体   English

匹配注释,除非起始字符被未转义的引号包围

[英]Match comments unless the initiating character is surrounded by unescaped quotes

With a regex: How can I match comments which begin with a semicolon unless the semicolon is surrounded on both sides by unescaped quotes, as shown below (the green blocks denote the matched comments )?:使用正则表达式:如何匹配以分号开头的注释,除非分号两边都被未转义的引号包围,如下所示(绿色块表示匹配的注释)?:

示例输入和输出

Note, that the dquotes can by escaped by doubling them up "" .请注意,双引号可以通过将它们加倍""来转义。 Such escaped dquotes behave as completely different characters, ie they do not have the ability to surround the semicolon and disable its comment-starting function.这样的转义双引号表现为完全不同的字符,即它们无法包围分号并禁用其注释开始 function。

Also, unbalanced dquotes are treated as escaped dquotes.此外,不平衡的双引号被视为转义双引号。

With Bubble's help, I have gotten as far as the regex below, which fails to correctly treat a trailing escaped dquote in the last test vector line.在 Bubble 的帮助下,我已经了解了下面的正则表达式,它无法正确处理最后一个测试向量行中尾随的转义双引号。

^(?>(?:""[^""\n]*""|[^;""\n]+)*)""?[^"";\n]*(;.*)

See it run here .看它在这里运行。

Test vectors (the same as in the color-coded diagram above):测试向量(与上面的颜色编码图中相同):

Peekaboo ; A comment starts with a semicolon and continues till the EOL
Unless the semicolon is surrounded by dquotes ”Don’t do it ; here” ;but match me; once
Im not surrounded ”so pay attention to me” ; ”peekaboo”
Im not surrounded ”so pay attention” to;me” ; ”peekaboo”
Im not surrounded ”so pay attention to me ; peekaboo
Dquote escapes a dquote so ”dont pay attention to ””me;here”” buster” do it ; here
Don’t pay attention to  ”””me;here””” but do ””it;here””
and ”dont do ””it;here”””  either ;peekaboo
but "pay attention to "it;here"" ;not here though
Simon said ”I like goats” then he added ”and sheep;” ;a good comment is ”here
Simon said ”I like goats” then he added ”and sheep;” dont do it here
Simon said ””I like goats;”peekaboo
Simon said ”I like goats;””peekaboo

The task is to find comments starting with a ;任务是找到;开头的评论。 semicolon outside quotes considering "" escaped quotes and a potential non-closed quote before.考虑到""转义引号和之前潜在的非闭合引号,引号外的分号 This approach works for yet provided test cases.这种方法适用于尚未提供的测试用例。

Updated pattern : A shorter and more efficient variant without alternation .更新模式更短、更高效的无交替变体。

^((?>(?:(?:[^"\n;]*"[^"\n]*")+(?!"))?[^"\n;]*)"?[^"\n;]*);.*

New demo at regex101 regex101 的新演示

This pattern works without alternation and uses a negative lookahead to check for the last valid double quote.此模式无需交替即可工作,并使用否定前瞻来检查最后一个有效的双引号。 In both patterns the atomic group mimics possessive quantifiers to prevent any backtracking and keep the balance.在这两种模式中,原子组模仿所有格量词以防止任何回溯并保持平衡。 Using possessive quantifiers the pattern would look like this regex101 demo .使用所有格量词,模式看起来像这个 regex101 演示 [^";\n]*"?[^";\n]* is the part that is allowing an optional non-closed quote. [^";\n]*"?[^";\n]*是允许可选的非闭合引号的部分。


Previous pattern : This turned out to be reliable yet but is a little bit slower.以前的模式结果证明这是可靠的,但速度有点慢。

^((?>(?:(?:[^;"\n]*"(?>(?:[^"\n]+|"")*)")+)?)[^";\n]*"?[^";\n]*);.*

Old demo at regex101 regex101 的旧演示

"(([^"]+|"")*)" consumes either " ... " or "" . This gets repeated any amount of times with any [^;"]* characters that are not ; "(([^"]+|"")*)"消耗" ... """ 。这会与任何[^;"]*不是的字符重复任意次数; or " in between. All that is done inside an atomic group . Having matched the quoted parts with any non semicolons in between due to use of an atomic group there is no way back. After finally allowing an optional non-closed " either a ;"介于两者之间。所有这些都是在原子组内完成的。由于使用原子组,将引用的部分与中间的任何非分号匹配,没有办法返回。最终允许可选的非封闭"要么是一个; will be found or it fails.将被发现或失败。


The first capturing group $1 contains the part up to the targeted ;一个捕获组$1包含到 targeted 的部分; comment-start .评论开始 To remove the comment, replace the full match with the captured part.要删除评论,请将完整匹配替换为捕获的部分。 If needed capture (.*) to a second group .如果需要捕获(.*)第二组

regex-part正则表达式部分 matches火柴
(?> ... ) (?> ... ) denotes an atomic group , used to prevent any further backtracking表示一个原子组,用于防止任何进一步的回溯
[^ ... ] [^ ... ] a negated character class matches a single character not in the listed 否定字符 class匹配不在列表中的单个字符
( ... ) and (?: ... ) ( ... )(?: ... ) capturing and non capturing groups (latter for repitition or alternation )捕获组和非捕获(后者用于重复交替
quantifiers : ?量词? * + * + ? matches zero or one ( optional ), * any amount and + one or more匹配零个或一个可选), *任意数量+一个或多个

If replacements are done on single lines, all the \n newlines can be dropped from either pattern.如果替换是在单行上完成的,则可以从任一模式中删除所有\n换行符

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM