简体   繁体   中英

Regex to get href value of links that do not have rel='nofollow'

I have a string that contains html link tags and I need to user php preg_match_all to get the href value of the tags, but only if the tag does not have a rel='nofollow' attribute. I found the following expression that gets the href value of all the links.

$regex= "/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU";

How can I modify it to only get the links I want? Here is what it should look like:

$string= "<a href='link1.php'>Link</a>";
$string.= "<a href='link2.php'>Link2</a>";
$string.= "<a href='link3.php' rel='nofollow'>Link3</a>";
$string.= "<a href='link4.php'>Link4</a>";

preg_match_all($regex, $string, $links);

so links should be:

$links[0] => 'link1.php';
$links[1] => 'link2.php';
$links[2] => 'link4.php';

I need the expression to pick up links that use both single and double quotes. Bonus would be to pick up ill formatted but still valid links. If it's not possible to get just the links I want then just a way to find the links I don't want and remove them from the array. Note string is generated dynamically and may not have the same attribute order and will contain other tags and characters besides just the links.

@revo is correct, this is not a job for regular expressions . Use a proper HTML parser to deconstruct the HTML, and then an XPath query to find the information you need.

$html = <<<HTML
<html>
<head>
<title>Example</title>
</head>
<body>
<a href='link1.php'>Link</a>
<a href="link's 2.php" class="link">Link2</a>
<a class="link" href='link3.php' rel='nofollow'>Link3</a>
<a href='link4.php'><span>Link4</span></a>
</body>
</html>
HTML;

$doc = new DOMDocument();
$valid = $doc->loadHTML($html);
$result = [];
if ($valid) {
  $xpath = new DOMXpath($doc);
  // find any <a> elements that do not have a rel="nofollow" attribute,
  // then pick up their href attribute
  $elements = $xpath->query("//a[not(@rel='nofollow')]/@href");
  if (!is_null($elements)) {
    foreach ($elements as $element) {
      $result[] = $element->nodeValue;
    }
  }
}
print_r($result);
# => Array
#    (
#        [0] => link1.php
#        [1] => link's 2.php
#        [2] => link4.php
#    )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM