如何从正则表达式中排除一个词

Question

我有一个有效的正则表达式。 但是我希望它删除具有特定单词的匹配项。

/\<meta[^\>]+(http\-equiv[^\>]+?refresh[^\>]+?(?<!\-)(?<!\d)[0-9]\d*[^\>]+?url[^\>]+?|(?<!\-)(?<!\d)[0-9]\d*[^\>]+?url[^\>]+?http\-equiv[^\>]+?refresh[^\>]+?)\/?\>/is

这与以下内容匹配：（http-equiv 和 url 以任何顺序排列）

<meta http-equiv="refresh" content="21;URL='http://example.com/'" />
<meta content="21;URL='http://example.com/'" http-equiv="refresh" />

我想排除任何具有?PageSpeed=noscript的 url

一个。 <meta content="21;URL='http://example.com/?PageSpeed=noscript'" http-equiv="refresh" /> b. <meta content="21;URL='http://example.com/segment?PageSpeed=noscript&var=value'" http-equiv="refresh" />

任何想法都非常感谢。 谢谢。

Answer 1

您可以使用 DOM Parser 而不是正则表达式。

<?php

$meta = '<meta content="21;URL=\'http://example.com/\'" http-equiv="refresh" /> <meta content="21;URL=\'http://example.com/?PageSpeed=noscript\'" http-equiv="refresh" />';

$dom = new DOMDocument;
$dom->loadHTML($meta);
$noPageScripts = [];

foreach ($dom->getElementsByTagName('meta') as $tag) {
  $content = $tag->getAttribute('content');
  // Match the URL
  preg_match('/URL=["\']?([^"\'>]+)["\']?/i',$content,$matches);

  if($tag->getAttribute('http-equiv') && isset($matches[1]) && stripos($matches[1],'?PageSpeed=noscript') === false) {
    $noPageScripts[] = [
      'originalTag' => $dom->saveHTML($tag),
      'url' => $matches[1]
    ];
  }
}

var_dump($noPageScripts);

这是小提琴

Answer 2

在我的想法中，我重写了整个模式以获得更好的性能，但有点不同。 基本上添加一个负前瞻，以防止在已经完成最多匹配的某个点匹配不允许的内容，例如我把它放在http -> http(??\S*?pagespeed=noscript)

\S*? 惰性匹配任意数量的非空白字符。 请参阅SO 正则表达式常见问题解答。

我尝试过的完整模式：

/<meta\s(?=[^><]*?http-equiv[^\w><]+refresh)[^><]*?url=[\s\'\"]*(http(?!\S*?pagespeed=noscript)[^><\s\"\']*)[^><]*>/i

另一个补充是使用积极的前瞻来匹配http-equiv...以独立于顺序。 类似于我很久以前在评论中放在 PHP.net 上的这个正则表达式模式。

如何从正则表达式中排除一个词

问题描述

1 个解决方案

解决方案1
0 2021-12-26 14:24:59

解决方案2
0 2021-12-28 16:18:33

如何从正则表达式中排除一个词

问题描述

1 个解决方案

解决方案1 0 2021-12-26 14:24:59

解决方案2 0 2021-12-28 16:18:33

解决方案1
0 2021-12-26 14:24:59

解决方案2
0 2021-12-28 16:18:33