简体   繁体   English

从某些生成的 HTML 中删除注释,这些注释可能对嵌套注释无效

[英]Remove comments from some generated HTML which can be invalid with nested comments

I would like to remove HTML comments from some generated content.我想从一些生成的内容中删除 HTML 条评论。 If I use the regex /<.--(?*?)-->/ (ungreedy with the ? ) then it works for most cases such as this example:如果我使用正则表达式/<.--(?*?)-->/ (不喜欢? )那么它适用于大多数情况,例如这个例子:

<!-- <h1> test </h1> --> not remove <!-- <h1> test 2 </h1> -->

It gets rid of the <h1> tags and leaves the " not remove " as desired.它摆脱了<h1>标签并根据需要保留“不删除”。

But if the comments are nested , then it will not handle it properly as it will leave the last comment closing tag '-->' .但是如果注释是嵌套的,那么它将无法正确处理它,因为它会留下最后一个注释结束标记'-->' The workaround would be to use a greedy pattern, but in this case it will not work for the first case, with multiple comments.解决方法是使用贪心模式,但在这种情况下,它不适用于第一种情况,有多个评论。

Example of nested comments (I know it's not valid HTML, but it's the backend which is generating it):嵌套评论的示例(我知道它无效 HTML,但它是生成它的后端):

text <!-- something <!-- <p> test </p> --> need remove -->

I've tried to find a solution, but I don't know how to solve this.我试图找到解决方案,但我不知道如何解决这个问题。 Has anyone an idea how to handle it?有谁知道如何处理它?

As you mentioned, it's frustrating because with the ungreedy rule you solve one case and with the greedy rule you solve the other, but you cannot solve both at the time.正如您所提到的,这令人沮丧,因为使用不贪婪的规则可以解决一个案例,而使用贪婪的规则可以解决另一个案例,但您无法同时解决这两个案例。 Well, you are lucky because PHP's PCRE engine accepts recursion:-) !嗯,你很幸运,因为 PHP 的 PCRE 引擎接受递归:-)!

So the problem can be solved with the magic of (?R) which acts a bit like a " Copy and paste the full pattern here ", as I've understood it.所以这个问题可以用(?R)的魔力来解决,它的作用有点像“在此处复制并粘贴完整模式”,正如我所理解的那样。

The pattern will be: /<?--(:?(.?<!--|-->).|(?R))*-->/gs模式将是:/<?--(:?(.?<!--|-->).|(?R))*-->/ /<?--(:?(.?<!--|-->).|(?R))*-->/gs

You can test it here: https://regex101.com/r/fZK8VP/1你可以在这里测试它: https://regex101.com/r/fZK8VP/1

Explained:解释:

  • <!-- matches the string "<.--". <!--匹配字符串“<.--”。

  • (?: | )* is a non-capturing group which can be repeated several times and with two options: (?: | )*是一个非捕获组,可以重复多次并有两个选项:

    A) First option: A)第一个选项:

    • (?!<!--|-->) is a negative lookahead with two options to say don't match if it's followed by "<!--" or by "-->". (?!<!--|-->)是一个否定前瞻,如果它后面跟着“<!--”“-->”,有两个选项表示不匹配

    • . matches any char.匹配任何字符。

    B) Second option: (?R) which is the entire pattern (recursion). B) 第二个选项:( (?R)是整个模式(递归)。

  • --> matches the string "-->". -->匹配字符串“-->”。

I've used the s pattern modifier as the .我使用s模式修饰符作为. should also match new lines in case you have some comments on several lines.如果您对多行有一些评论,也应该匹配新行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM