[英]Scrape anchor (<a>) html tags
我需要在 HTML 中抓取<a>
标签。
我的目标是抓取在其 href 属性中具有有效链接的标签。
我想我非常接近答案,这是我写的正则表达式:
<a .*href=("|').*\.asp("|').*?>.*?<\/a>
第一期:
结果:
<a id='topnavbtn_tutorials' href='javascript:void(0);' onclick='w3_open_nav("tutorials")' title='Tutorials'>TUTORIALS <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a id='topnavbtn_references' href='javascript:void(0);' onclick='w3_open_nav("references")' title='References'>REFERENCES <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a id='topnavbtn_examples' href='javascript:void(0);' onclick='w3_open_nav("examples")' title='Examples'>EXAMPLES <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a href='/forum/default.asp'>FORUM</a>
我只需要:
<a href='/forum/default.asp'>FORUM</a>
第二期:
结果:
<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a><a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a><a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a><a href='/sql/default.asp' class='w3-hide-small' title='SQL Tutorial'>SQL</a><a href='/php/default.asp' class='w3-hide-small' title='PHP Tutorial'>PHP</a><a href='/bootstrap/default.asp' class='w3-hide-small' title='Bootstrap Tutorial'>BOOTSTRAP</a><a href='/jquery/default.asp' class='w3-hide-small' title='jQuery Tutorial'>JQUERY</a><a href='/angular/default.asp' class='w3-hide-small' title='Angular Tutorial'>ANGULAR</a><a href='/xml/default.asp' class='w3-hide-small' title='XML Tutorial'>XML</a>
我需要它们作为单独的结果:
<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a>
<a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a>
<a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a>
等等...
更新。 见下文。
如果您有字符串形式的 HTML,您可以执行以下操作:
// split the string up by anchor tags
// nested anchor tags is illegal, so this seems feasible:
var anchorArray = str.replace(/><a/g, '>¶<a').split('¶'); // ¶ is a placeholder to split
var matches = [];
var re = /<a .*href=["'].*\.asp["'].*?>.*?<\/a>/g;
// filter out the anchor elements with actual links in the final HTML
anchorArray.filter(function(element) {
if (re.test(element)) {
matches.push(element); // keep the match in an array (2nd condition)
return false;
}
else return true;
});
var returnedHTML = anchorArray.join(''); // HTML w/o actual links (1st condition)
请注意,解析 HTML 的首选方法不是使用正则表达式,而是使用 HTML 解析器。
这会帮助你
var matches = [];
input_content.replace(/[^<]*(<a href="([^"]+)">/w*<\a>)/g, function () {
matches.push(Array.prototype.slice.call(arguments, 1))
});
它以数组的形式返回匹配变量的所有匹配项!
$string = "<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a><a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a><a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a><a href='/sql/default.asp' class='w3-hide-small' title='SQL Tutorial'>SQL</a><a href='/php/default.asp' class='w3-hide-small' title='PHP Tutorial'>PHP</a><a href='/bootstrap/default.asp' class='w3-hide-small' title='Bootstrap Tutorial'>BOOTSTRAP</a><a href='/jquery/default.asp' class='w3-hide-small' title='jQuery Tutorial'>JQUERY</a><a href='/angular/default.asp' class='w3-hide-small' title='Angular Tutorial'>ANGULAR</a><a href='/xml/default.asp' class='w3-hide-small' title='XML Tutorial'>XML</a>";
preg_match_all('%<a href=\'/.*?\'>.*?</a>%s', $string, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[0]); $i++) {
echo $matches[0][$i];
}
输出:
<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a>
<a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a>
<a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a>
<a href='/sql/default.asp' class='w3-hide-small' title='SQL Tutorial'>SQL</a>
<a href='/php/default.asp' class='w3-hide-small' title='PHP Tutorial'>PHP</a>
<a href='/bootstrap/default.asp' class='w3-hide-small' title='Bootstrap Tutorial'>BOOTSTRAP</a>
<a href='/jquery/default.asp' class='w3-hide-small' title='jQuery Tutorial'>JQUERY</a>
<a href='/angular/default.asp' class='w3-hide-small' title='Angular Tutorial'>ANGULAR</a>
<a href='/xml/default.asp' class='w3-hide-small' title='XML Tutorial'>XML</a>
演示:
注意:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.