抓取A元素的href属性

Question

Trying to find the links on a page. 试图在页面上找到链接。

my regex is: 我的正则表达式是：

/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/

but seems to fail at 但似乎失败了

<a title="this" href="that">what?</a>

How would I change my regex to deal with href not placed first in the a tag? 我该如何更改我的正则表达式以处理未置于a标签首位的href？

Answer 1

Reliable Regex for HTML are difficult . 可靠的HTML正则表达式很难。 Here is how to do it with DOM : 这是使用DOM的方法：

$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
    echo $dom->saveHtml($node), PHP_EOL;
}

The above would find and output the "outerHTML" of all A elements in the $html string. 上面的代码将找到并输出$html字符串中所有A元素的“ outerHTML” 。

To get all the text values of the node, you do 要获取节点的所有文本值，请执行以下操作

echo $node->nodeValue;

To check if the href attribute exists you can do 要检查 href属性是否存在，您可以执行以下操作

echo $node->hasAttribute( 'href' );

To get the href attribute you'd do 要获取 href属性，您需要执行

echo $node->getAttribute( 'href' );

To change the href attribute you'd do 更改 href属性

$node->setAttribute('href', 'something else');

To remove the href attribute you'd do 删除 href属性

$node->removeAttribute('href');

You can also query for the href attribute directly with XPath 您也可以直接使用XPath查询href属性

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
    echo $href->nodeValue;                       // echo current attribute value
    $href->nodeValue = 'new value';              // set new attribute value
    $href->parentNode->removeAttribute('href');  // remove attribute
}

Also see: 另请参阅：

Best methods to parse HTML 解析HTML的最佳方法
DOMDocument in php PHP中的DOMDocument

On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here 在旁注：我确定这是重复的，您可以在这里的某个地方找到答案

Answer 2

I agree with Gordon, you MUST use an HTML parser to parse HTML. 我同意戈登的观点，您必须使用HTML解析器来解析HTML。 But if you really want a regex you can try this one : 但是，如果您真的想要正则表达式，可以尝试以下方法：

/^<a.*?href=(["\'])(.*?)\1.*$/

This matches <a at the begining of the string, followed by any number of any char (non greedy) .*? 这在字符串的开头匹配<a ，后跟任意数量的任何字符（非贪婪） .*? then href= followed by the link surrounded by either " or ' 然后href=后跟用"或'包围的链接

$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);

Output: 输出：

array(3) {
  [0]=>
  string(37) "<a title="this" href="that">what?</a>"
  [1]=>
  string(1) """
  [2]=>
  string(4) "that"
}

Answer 3

您要查找的模式将是链接锚模式，例如（某物）：

$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";

Answer 4

why don't you just match 你为什么不匹配

"<a.*?href\s*=\s*['"](.*?)['"]"

<?php

$str = '<a title="this" href="that">what?</a>';

$res = array();

preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);

var_dump($res);

?>

then 然后

$ php test.php
array(2) {
  [0]=>
  array(1) {
    [0]=>
    string(27) "<a title="this" href="that""
  }
  [1]=>
  array(1) {
    [0]=>
    string(4) "that"
  }
}

which works. 哪个有效。 I've just removed the first capture braces. 我刚刚删除了第一个捕获括号。

Answer 5

For the one who still not get the solutions very easy and fast using SimpleXML 对于仍然无法使用SimpleXML轻松获得解决方案的人

$a = new SimpleXMLElement('<a href="www.something.com">Click here</a>');
echo $a['href']; // will echo www.something.com

Its working for me 它为我工作

Answer 6

I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var() 我不确定您要在这里做什么，但是如果您要验证链接，请查看PHP的filter_var（）

If you really need to use a regular expression then check out this tool, it may help: http://regex.larsolavtorvik.com/ 如果您确实需要使用正则表达式，请查看此工具，它可能会有所帮助： http : //regex.larsolavtorvik.com/

Answer 7

Using your regex, I modified it a bit to suit your need. 使用您的正则表达式，我对其做了一些修改以满足您的需要。

<a.*?href=("|')(.*?)("|').*?>(.*)<\\/a>

I personally suggest you use a HTML Parser 我个人建议您使用HTML解析器

EDIT: Tested 编辑：经过测试

Answer 8

Quick test: <a\\s+[^>]*href=(\\"\\'??)([^\\1]+)(?:\\1)>(.*)<\\/a> seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'. 快速测试： <a\\s+[^>]*href=(\\"\\'??)([^\\1]+)(?:\\1)>(.*)<\\/a>似乎可以技巧，第一个匹配为“或”，第二个为“ href”值“ that”，第三个为“ what？”。

The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same. 我之所以将第一个匹配项“ /”留在其中是因为以后可以使用它反向引用以结束“ /”，因此它是相同的。

See live example on: http://www.rubular.com/r/jsKyK2b6do 参见以下示例： http : //www.rubular.com/r/jsKyK2b6do

Answer 9

preg_match_all("/(] >)(. ?)(</a)/", $contents, $impmatches, PREG_SET_ORDER); preg_match_all（ “/（]>）（）（</ A）/？”，$内容，$ impmatches，PREG_SET_ORDER）;

It is tested and it fetch all a tag from any html code. 经过测试，它可以从任何html代码中提取所有标签。

Answer 10

The following is working for me and returns both href and value of the anchor tag. 以下内容对我href并且同时返回href标签和href value 。

preg_match_all("'\<a.*?href=\"(.*?)\".*?\>(.*?)\<\/a\>'si", $html, $match);
if($match) {
    foreach($match[0] as $k => $e) {
        $urls[] = array(
            'anchor'    =>  $e,
            'href'      =>  $match[1][$k],
            'value'     =>  $match[2][$k]
        );
    }
}

The multidimensional array called $urls contains now associative sub-arrays that are easy to use. 名为$urls的多维数组现在包含易于使用的关联子数组。

抓取A元素的href属性

问题描述

10 个解决方案

解决方案1
207 已采纳 2010-09-29 10:35:53

解决方案2
19 2010-09-29 11:43:02

解决方案3
5 2010-09-29 10:22:23

解决方案4
3 2010-09-29 10:21:13

解决方案5
3 2016-08-26 11:17:59

解决方案6
2 2010-09-29 10:25:32

解决方案7
2 2010-09-29 10:25:36

解决方案8
1 2010-09-29 10:23:22

解决方案9
0 2016-07-06 05:23:10

解决方案10
0 2019-01-22 12:54:27

抓取A元素的href属性

问题描述

10 个解决方案

解决方案1 207 已采纳 2010-09-29 10:35:53

解决方案2 19 2010-09-29 11:43:02

解决方案3 5 2010-09-29 10:22:23

解决方案4 3 2010-09-29 10:21:13

解决方案5 3 2016-08-26 11:17:59

解决方案6 2 2010-09-29 10:25:32

解决方案7 2 2010-09-29 10:25:36

解决方案8 1 2010-09-29 10:23:22

解决方案9 0 2016-07-06 05:23:10

解决方案10 0 2019-01-22 12:54:27

解决方案1
207 已采纳 2010-09-29 10:35:53

解决方案2
19 2010-09-29 11:43:02

解决方案3
5 2010-09-29 10:22:23

解决方案4
3 2010-09-29 10:21:13

解决方案5
3 2016-08-26 11:17:59

解决方案6
2 2010-09-29 10:25:32

解决方案7
2 2010-09-29 10:25:36

解决方案8
1 2010-09-29 10:23:22

解决方案9
0 2016-07-06 05:23:10

解决方案10
0 2019-01-22 12:54:27