简体   繁体   中英

Regex Pattern (…)+ not matching multiple times

I'm building a PHP script that will sift through the HTML contents of a cURL request and match patterns for URLs so that I can manipulate add a GET tag to track outbound links.

I have the Regex pattern that's working, but I can't get it to match more than once; it won't even find a duplicate of the item it does match.

This is the sample HTML, which is currently only matching the first Anchor tag:

`<html><head>
 <title></title>
</head>
<body class="body class">
 <div>
   <a title="1hubwhrrstn" href="http://www.example.com?tag=9qgbc"></a>
   <a name=""></a>
   <a class="3hubwhbbsrstn" href="http://www.example.com?tag=uqgibc"></a>
   <a class="4whbihbw4bsetrrstn" href="http://www.example.com?tag=9uq4i"></a>
   <a href="http://www.example.com?tag=9uq4i" class="4whbihbstn"></a>
 </div></body>
</html>`

The Regex pattern I'm using is: (<a.*href=".*".*><\\/a>)+/im , and it's only matching the first anchor instance.

Also, I can't find a way tell it to match a new line or all on one line - it gives me one match, running multiple Anchor tags all together when they're on the same line, even though I'm using a capturing group to match the pattern to one anchor tag. So in this case, it's finding one match - even for the doubled Anchors on the same line:

`<html><head>
 <title></title>
</head>
<body class="body class">
 <div>
   <a title="1tn" href="http://www.example.com"></a><a class="3htn" href="http://www.example.com"></a>
   <a name=""></a>
   <a class="4whbihbw4bsetrrstn" href="http://www.example.com?tag=9uq4i"></a>
   <a href="http://www.example.com?tag=9uq4i" class="4whbihbstn"></a>
 </div></body>
</html>`

I've gone through two hours of tinkering and double checking flags and quantifiers, testing as I go on regex101.com and can't figure where I'm making a mistake.

Any help would be great. Thanks so much!

Your regex (<a.*href=".*".*><\\/a>)+/im is greedy. To make it less greedy you can reject any pattern that has < inside the anchor tag:

(<a.*href=".*".*>[^<]*<\/a>)+/im

This addresses another potential problem: anchor tags without content inside them are unusual, and this pattern matches any content in the tag as long as it's not another tag (of course, having other tags inside an href is allowed in html so this solution may not be sufficient).

Also, I'm not so sure you need the m modifier at the end. It's for matching patterns that span multiple lines, and it seems your matching patterns are all on a single line.

I'm guessing you are using preg_match() ? Use preg_match_all() to do a global reg ex match since you can't use the g modifier with preg_match()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM