What's wrong with my PHP regex?

Question

I'm trying to pull a specific link from a feed where all of the content is on one line and there are multiple links present. The one I want has the content of "[link]" in the the A tag. Here's my example:

<a href="google.com/">test1</a> <a href="google.com/">test2</a> <a href="http://www.amazingpage.com/">[link]</a> <a href="google.com/">test3</a><a href="google.com/">test4</a>
... could be more links before and/or after

How do I isolate just the href with the content "[link]"?

This regex goes to the correct end of the block I want, but starts at the first link:

(?<=href\=\").*?(?=\[link\])

Any help would be greatly appreciated! Thanks.

Answer 1

Try this updated regex:

(?<=href\=\")[^<]*?(?=\">\[link\])

See demo . The problem is that the dot matches too many characters and in order to get the right 'href' you need to just restrict the regex to [^<]*? .

Answer 2

Alternatively :)

This code :

$string = '<a href="google.com/">test1</a> <a href="google.com/">test2</a> <a href="http://www.amazingpage.com/">[link]</a> <a href="google.com/">test3</a><a href="google.com/">test4</a>';
$regex = '/href="([^"]*)">\[link\]/i';
$result = preg_match($regex, $string, $matches);
var_dump($matches);

Will return :

array(2) {
  [0] =>
  string(41) "href="http://www.amazingpage.com/">[link]"
  [1] =>
  string(27) "http://www.amazingpage.com/"
}

Answer 3

You can avoid using regular expression and use DOM to do this.

$doc = DOMDocument::loadHTML('
     <a href="google.com/">test1</a>
     <a href="google.com/">test2</a>
     <a href="http://www.amazingpage.com/">[link]</a>
     <a href="google.com/">test3</a>
     <a href="google.com/">test4</a>
');

foreach ($doc->getElementsByTagName('a') as $link) {
   if ($link->nodeValue == '[link]') {
     echo $link->getAttribute('href');
   }
}

Answer 4

With DOMDocument and XPath:

$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);

foreach ($xpath->query('//a[. = "[link]"]/@href') as $node) {
    echo $node->nodeValue;
}

or if you are looking for only one result:

$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);

$nodeList = $xp->query('//a[. = "[link]"][1]/@href');
if ($nodeList->length) 
    echo $nodeList->item(0)->nodeValue;

xpath query details:

//a              # 'a' tag everywhere in the DOM tree
[. = "[link]"]   # (condition) which has "[link]" as value 
/@href           # "href" attribute

The reason your regex pattern doesn't work:

The regex engine walks from left to right and for each position in the string it tries to succeed. So, even if you use a non-greedy quantifier, you obtain always the leftmost result.

What's wrong with my PHP regex?

Question

4 answers

solution1
3 ACCPTED 2015-03-01 23:50:20

solution2
2 2015-03-01 23:55:32

solution3
1 2015-03-02 00:00:04

solution4
1 2015-03-02 00:12:27

What's wrong with my PHP regex?

Question

4 answers

solution1 3 ACCPTED 2015-03-01 23:50:20

solution2 2 2015-03-01 23:55:32

solution3 1 2015-03-02 00:00:04

solution4 1 2015-03-02 00:12:27

solution1
3 ACCPTED 2015-03-01 23:50:20

solution2
2 2015-03-01 23:55:32

solution3
1 2015-03-02 00:00:04

solution4
1 2015-03-02 00:12:27