繁体   English   中英

抓取锚点 ( <a>) html 标签</a>

[英]Scrape anchor (<a>) html tags

我需要在 HTML 中抓取<a>标签。

我的目标是抓取在其 href 属性中具有有效链接的标签。

我想我非常接近答案,这是我写的正则表达式:

<a .*href=("|').*\.asp("|').*?>.*?<\/a>

http://regexr.com/3d989

第一期:

结果:

<a id='topnavbtn_tutorials' href='javascript:void(0);' onclick='w3_open_nav("tutorials")' title='Tutorials'>TUTORIALS <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a id='topnavbtn_references' href='javascript:void(0);' onclick='w3_open_nav("references")' title='References'>REFERENCES <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a id='topnavbtn_examples' href='javascript:void(0);' onclick='w3_open_nav("examples")' title='Examples'>EXAMPLES <i class='fa fa-caret-down'></i><i class='fa fa-caret-up' style='display:none'></i></a><a href='/forum/default.asp'>FORUM</a>

我只需要:

<a href='/forum/default.asp'>FORUM</a>

第二期:

结果:

<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a><a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a><a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a><a href='/sql/default.asp' class='w3-hide-small' title='SQL Tutorial'>SQL</a><a href='/php/default.asp' class='w3-hide-small' title='PHP Tutorial'>PHP</a><a href='/bootstrap/default.asp' class='w3-hide-small' title='Bootstrap Tutorial'>BOOTSTRAP</a><a href='/jquery/default.asp' class='w3-hide-small' title='jQuery Tutorial'>JQUERY</a><a href='/angular/default.asp' class='w3-hide-small' title='Angular Tutorial'>ANGULAR</a><a href='/xml/default.asp' class='w3-hide-small' title='XML Tutorial'>XML</a>

我需要它们作为单独的结果:

<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a>

<a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a>

<a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a>

等等...

更新。 见下文。

如果您有字符串形式的 HTML,您可以执行以下操作:

// split the string up by anchor tags
// nested anchor tags is illegal, so this seems feasible:
var anchorArray = str.replace(/><a/g, '>¶<a').split('¶'); // ¶ is a placeholder to split

var matches = [];
var re = /<a .*href=["'].*\.asp["'].*?>.*?<\/a>/g;

// filter out the anchor elements with actual links in the final HTML
anchorArray.filter(function(element) { 
    if (re.test(element)) {
        matches.push(element); // keep the match in an array (2nd condition)
        return false; 
    }
    else return true;       
});

var returnedHTML = anchorArray.join('');  // HTML w/o actual links (1st condition)

请注意,解析 HTML 的首选方法不是使用正则表达式,而是使用 HTML 解析器。

这会帮助你

var matches = [];

input_content.replace(/[^<]*(<a href="([^"]+)">/w*<\a>)/g, function () {
    matches.push(Array.prototype.slice.call(arguments, 1))
});

它以数组的形式返回匹配变量的所有匹配项!

$string = "<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a><a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a><a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a><a href='/sql/default.asp' class='w3-hide-small' title='SQL Tutorial'>SQL</a><a href='/php/default.asp' class='w3-hide-small' title='PHP Tutorial'>PHP</a><a href='/bootstrap/default.asp' class='w3-hide-small' title='Bootstrap Tutorial'>BOOTSTRAP</a><a href='/jquery/default.asp' class='w3-hide-small' title='jQuery Tutorial'>JQUERY</a><a href='/angular/default.asp' class='w3-hide-small' title='Angular Tutorial'>ANGULAR</a><a href='/xml/default.asp' class='w3-hide-small' title='XML Tutorial'>XML</a>";

preg_match_all('%<a href=\'/.*?\'>.*?</a>%s', $string, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[0]); $i++) {
    echo $matches[0][$i];
}

输出:

<a href='/html/default.asp' class='w3-hide-small' title='HTML Tutorial'>HTML</a>
<a href='/css/default.asp' class='w3-hide-small' title='CSS Tutorial'>CSS</a>
<a href='/js/default.asp' class='w3-hide-small' title='JavaScript Tutorial'>JAVASCRIPT</a>
<a href='/sql/default.asp' class='w3-hide-small' title='SQL Tutorial'>SQL</a>
<a href='/php/default.asp' class='w3-hide-small' title='PHP Tutorial'>PHP</a>
<a href='/bootstrap/default.asp' class='w3-hide-small' title='Bootstrap Tutorial'>BOOTSTRAP</a>
<a href='/jquery/default.asp' class='w3-hide-small' title='jQuery Tutorial'>JQUERY</a>
<a href='/angular/default.asp' class='w3-hide-small' title='Angular Tutorial'>ANGULAR</a>
<a href='/xml/default.asp' class='w3-hide-small' title='XML Tutorial'>XML</a>

演示:

https://ideone.com/eFHU8n


注意:

为什么不应该使用正则表达式来解析 html

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM