简体   繁体   English

正则表达式模式(…)+多次不匹配

[英]Regex Pattern (…)+ not matching multiple times

I'm building a PHP script that will sift through the HTML contents of a cURL request and match patterns for URLs so that I can manipulate add a GET tag to track outbound links. 我正在构建一个PHP脚本,该脚本将筛选cURL请求的HTML内容并匹配URL的模式,以便我可以操纵添加GET标记以跟踪出站链接。

I have the Regex pattern that's working, but I can't get it to match more than once; 我有可以使用的Regex模式,但是我不能多次匹配它。 it won't even find a duplicate of the item it does match. 它甚至不会找到它确实匹配的项目的副本。

This is the sample HTML, which is currently only matching the first Anchor tag: 这是示例HTML,目前仅与第一个Anchor标签匹配:

`<html><head>
 <title></title>
</head>
<body class="body class">
 <div>
   <a title="1hubwhrrstn" href="http://www.example.com?tag=9qgbc"></a>
   <a name=""></a>
   <a class="3hubwhbbsrstn" href="http://www.example.com?tag=uqgibc"></a>
   <a class="4whbihbw4bsetrrstn" href="http://www.example.com?tag=9uq4i"></a>
   <a href="http://www.example.com?tag=9uq4i" class="4whbihbstn"></a>
 </div></body>
</html>`

The Regex pattern I'm using is: (<a.*href=".*".*><\\/a>)+/im , and it's only matching the first anchor instance. 我正在使用的Regex模式是: (<a.*href=".*".*><\\/a>)+/im ,它仅与第一个锚点实例匹配。

Also, I can't find a way tell it to match a new line or all on one line - it gives me one match, running multiple Anchor tags all together when they're on the same line, even though I'm using a capturing group to match the pattern to one anchor tag. 另外,我找不到一种方法来匹配新行或全部匹配-它给了我一次匹配,即使它们在同一行上也一起运行多个Anchor标签,即使我使用的是捕获组以将模式匹配到一个锚标签。 So in this case, it's finding one match - even for the doubled Anchors on the same line: 因此,在这种情况下,它会找到一个匹配项-即使是同一行中的两倍锚点也是如此:

`<html><head>
 <title></title>
</head>
<body class="body class">
 <div>
   <a title="1tn" href="http://www.example.com"></a><a class="3htn" href="http://www.example.com"></a>
   <a name=""></a>
   <a class="4whbihbw4bsetrrstn" href="http://www.example.com?tag=9uq4i"></a>
   <a href="http://www.example.com?tag=9uq4i" class="4whbihbstn"></a>
 </div></body>
</html>`

I've gone through two hours of tinkering and double checking flags and quantifiers, testing as I go on regex101.com and can't figure where I'm making a mistake. 我经历了两个小时的修补工作,仔细检查了标志和数量词,在进行regex101.com时进行了测试,无法弄清楚我在哪里犯错了。

Any help would be great. 任何帮助都会很棒。 Thanks so much! 非常感谢!

Your regex (<a.*href=".*".*><\\/a>)+/im is greedy. 您的正则表达式(<a.*href=".*".*><\\/a>)+/im是贪婪的。 To make it less greedy you can reject any pattern that has < inside the anchor tag: 为了减少贪婪感,您可以拒绝锚标记中带有<任何模式:

(<a.*href=".*".*>[^<]*<\/a>)+/im

This addresses another potential problem: anchor tags without content inside them are unusual, and this pattern matches any content in the tag as long as it's not another tag (of course, having other tags inside an href is allowed in html so this solution may not be sufficient). 这解决了另一个潜在的问题:锚标签中没有内容是不寻常的,并且此模式匹配标签中的任何内容,只要它不是另一个标签即可(当然,html中允许在href中包含其他标签,因此此解决方案可能不会足够)。

Also, I'm not so sure you need the m modifier at the end. 另外,我不确定您最后是否需要m修饰符。 It's for matching patterns that span multiple lines, and it seems your matching patterns are all on a single line. 它是用于跨越多行的匹配模式,似乎您的匹配模式都在一行上。

I'm guessing you are using preg_match() ? 我猜您正在使用preg_match()吗? Use preg_match_all() to do a global reg ex match since you can't use the g modifier with preg_match() 使用preg_match_all()进行全局正则匹配,因为您不能将g修饰符与preg_match()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM