[英]Grabbing the href attribute of an A element
Trying to find the links on a page. 试图在页面上找到链接。
my regex is: 我的正则表达式是:
/<a\s[^>]*href=(\"\'??)([^\"\' >]*?)[^>]*>(.*)<\/a>/
but seems to fail at 但似乎失败了
<a title="this" href="that">what?</a>
How would I change my regex to deal with href not placed first in the a tag? 我该如何更改我的正则表达式以处理未置于a标签首位的href?
Reliable Regex for HTML are difficult . 可靠的HTML正则表达式很难 。 Here is how to do it with DOM : 这是使用DOM的方法 :
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
}
The above would find and output the "outerHTML" of all A
elements in the $html
string. 上面的代码将找到并输出$html
字符串中所有A
元素的“ outerHTML” 。
To get all the text values of the node, you do 要获取节点的所有文本值,请执行以下操作
echo $node->nodeValue;
To check if the href
attribute exists you can do 要检查 href
属性是否存在,您可以执行以下操作
echo $node->hasAttribute( 'href' );
To get the href
attribute you'd do 要获取 href
属性,您需要执行
echo $node->getAttribute( 'href' );
To change the href
attribute you'd do 更改 href
属性
$node->setAttribute('href', 'something else');
To remove the href
attribute you'd do 删除 href
属性
$node->removeAttribute('href');
You can also query for the href
attribute directly with XPath 您也可以直接使用XPath查询href
属性
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
echo $href->nodeValue; // echo current attribute value
$href->nodeValue = 'new value'; // set new attribute value
$href->parentNode->removeAttribute('href'); // remove attribute
}
Also see: 另请参阅:
On a sidenote: I am sure this is a duplicate and you can find the answer somewhere in here 在旁注:我确定这是重复的,您可以在这里的某个地方找到答案
I agree with Gordon, you MUST use an HTML parser to parse HTML. 我同意戈登的观点,您必须使用HTML解析器来解析HTML。 But if you really want a regex you can try this one : 但是,如果您真的想要正则表达式,可以尝试以下方法:
/^<a.*?href=(["\'])(.*?)\1.*$/
This matches <a
at the begining of the string, followed by any number of any char (non greedy) .*?
这在字符串的开头匹配<a
,后跟任意数量的任何字符(非贪婪) .*?
then href=
followed by the link surrounded by either "
or '
然后href=
后跟用"
或'
包围的链接
$str = '<a title="this" href="that">what?</a>';
preg_match('/^<a.*?href=(["\'])(.*?)\1.*$/', $str, $m);
var_dump($m);
Output: 输出:
array(3) {
[0]=>
string(37) "<a title="this" href="that">what?</a>"
[1]=>
string(1) """
[2]=>
string(4) "that"
}
您要查找的模式将是链接锚模式,例如(某物):
$regex_pattern = "/<a href=\"(.*)\">(.*)<\/a>/";
why don't you just match 你为什么不匹配
"<a.*?href\s*=\s*['"](.*?)['"]"
<?php
$str = '<a title="this" href="that">what?</a>';
$res = array();
preg_match_all("/<a.*?href\s*=\s*['\"](.*?)['\"]/", $str, $res);
var_dump($res);
?>
then 然后
$ php test.php
array(2) {
[0]=>
array(1) {
[0]=>
string(27) "<a title="this" href="that""
}
[1]=>
array(1) {
[0]=>
string(4) "that"
}
}
which works. 哪个有效。 I've just removed the first capture braces. 我刚刚删除了第一个捕获括号。
For the one who still not get the solutions very easy and fast using SimpleXML 对于仍然无法使用SimpleXML轻松获得解决方案的人
$a = new SimpleXMLElement('<a href="www.something.com">Click here</a>');
echo $a['href']; // will echo www.something.com
Its working for me 它为我工作
I'm not sure what you're trying to do here, but if you're trying to validate the link then look at PHP's filter_var() 我不确定您要在这里做什么,但是如果您要验证链接,请查看PHP的filter_var()
If you really need to use a regular expression then check out this tool, it may help: http://regex.larsolavtorvik.com/ 如果您确实需要使用正则表达式,请查看此工具,它可能会有所帮助: http : //regex.larsolavtorvik.com/
Using your regex, I modified it a bit to suit your need. 使用您的正则表达式,我对其做了一些修改以满足您的需要。
<a.*?href=("|')(.*?)("|').*?>(.*)<\\/a>
I personally suggest you use a HTML Parser 我个人建议您使用HTML解析器
EDIT: Tested 编辑:经过测试
Quick test: <a\\s+[^>]*href=(\\"\\'??)([^\\1]+)(?:\\1)>(.*)<\\/a>
seems to do the trick, with the 1st match being " or ', the second the 'href' value 'that', and the third the 'what?'. 快速测试: <a\\s+[^>]*href=(\\"\\'??)([^\\1]+)(?:\\1)>(.*)<\\/a>
似乎可以技巧,第一个匹配为“或”,第二个为“ href”值“ that”,第三个为“ what?”。
The reason I left the first match of "/' in there is that you can use it to backreference it later for the closing "/' so it's the same. 我之所以将第一个匹配项“ /”留在其中是因为以后可以使用它反向引用以结束“ /”,因此它是相同的。
See live example on: http://www.rubular.com/r/jsKyK2b6do 参见以下示例: http : //www.rubular.com/r/jsKyK2b6do
preg_match_all("/(] >)(. ?)(</a)/", $contents, $impmatches, PREG_SET_ORDER); preg_match_all( “/(]>)()(</ A)/?”,$内容,$ impmatches,PREG_SET_ORDER);
It is tested and it fetch all a tag from any html code. 经过测试,它可以从任何html代码中提取所有标签。
The following is working for me and returns both href
and value
of the anchor tag. 以下内容对我href
并且同时返回href
标签和href
value
。
preg_match_all("'\<a.*?href=\"(.*?)\".*?\>(.*?)\<\/a\>'si", $html, $match);
if($match) {
foreach($match[0] as $k => $e) {
$urls[] = array(
'anchor' => $e,
'href' => $match[1][$k],
'value' => $match[2][$k]
);
}
}
The multidimensional array called $urls
contains now associative sub-arrays that are easy to use. 名为$urls
的多维数组现在包含易于使用的关联子数组。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.