[英]How to exclude a word from regex
I have a regex that works.我有一个有效的正则表达式。 However I want it to drop matches that have a specific word.但是我希望它删除具有特定单词的匹配项。
/\<meta[^\>]+(http\-equiv[^\>]+?refresh[^\>]+?(?<!\-)(?<!\d)[0-9]\d*[^\>]+?url[^\>]+?|(?<!\-)(?<!\d)[0-9]\d*[^\>]+?url[^\>]+?http\-equiv[^\>]+?refresh[^\>]+?)\/?\>/is
This matches the following: (http-equiv and url in any order)这与以下内容匹配:(http-equiv 和 url 以任何顺序排列)
<meta http-equiv="refresh" content="21;URL='http://example.com/'" />
<meta content="21;URL='http://example.com/'" http-equiv="refresh" />
I want to exclude any url that has ?PageSpeed=noscript
我想排除任何具有?PageSpeed=noscript
的 url
a.一个。 <meta content="21;URL='http://example.com/?PageSpeed=noscript'" http-equiv="refresh" />
b. <meta content="21;URL='http://example.com/?PageSpeed=noscript'" http-equiv="refresh" />
b. <meta content="21;URL='http://example.com/segment?PageSpeed=noscript&var=value'" http-equiv="refresh" />
Any ideas are much appreciated.任何想法都非常感谢。 Thanks.谢谢。
You may use the DOM Parser instead of regex.您可以使用 DOM Parser 而不是正则表达式。
<?php
$meta = '<meta content="21;URL=\'http://example.com/\'" http-equiv="refresh" /> <meta content="21;URL=\'http://example.com/?PageSpeed=noscript\'" http-equiv="refresh" />';
$dom = new DOMDocument;
$dom->loadHTML($meta);
$noPageScripts = [];
foreach ($dom->getElementsByTagName('meta') as $tag) {
$content = $tag->getAttribute('content');
// Match the URL
preg_match('/URL=["\']?([^"\'>]+)["\']?/i',$content,$matches);
if($tag->getAttribute('http-equiv') && isset($matches[1]) && stripos($matches[1],'?PageSpeed=noscript') === false) {
$noPageScripts[] = [
'originalTag' => $dom->saveHTML($tag),
'url' => $matches[1]
];
}
}
var_dump($noPageScripts);
In my idea I rewrote the whole pattern for better performance but a bit different.在我的想法中,我重写了整个模式以获得更好的性能,但有点不同。 Basically add a negative lookahead to prevent matching the disallowed stuff at some point where the most matching is already done, eg I put it after http
-> http(??\S*?pagespeed=noscript)
基本上添加一个负前瞻,以防止在已经完成最多匹配的某个点匹配不允许的内容,例如我把它放在http
-> http(??\S*?pagespeed=noscript)
The \S*?
\S*?
matches any amount of non whitespace characters lazily .惰性匹配任意数量的非空白字符。 See the SO regex faq .请参阅SO 正则表达式常见问题解答。
And the full pattern I tried around with:我尝试过的完整模式:
/<meta\s(?=[^><]*?http-equiv[^\w><]+refresh)[^><]*?url=[\s\'\"]*(http(?!\S*?pagespeed=noscript)[^><\s\"\']*)[^><]*>/i
Another addition, is to use a positive lookahead for matching http-equiv...
to be independent of order.另一个补充是使用积极的前瞻来匹配http-equiv...
以独立于顺序。 Similarthis regex pattern which I put a long time ago on PHP.net in the comments.类似于我很久以前在评论中放在 PHP.net 上的这个正则表达式模式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.